98,856 research outputs found

    2-Dimensional String Problems: Data Structures and Quantum Algorithms

    Get PDF
    The field of stringology studies algorithms and data structures used for processing strings efficiently. The goal of this thesis is to investigate 2-dimensional (2D) variants of some fundamental string problems, including \textit{Exact Pattern Matching} and \textit{Longest Common Substring}. In the 2D pattern matching problem, we are given a matrix \M[1\dd n,1\dd n] that consists of N=n×nN = n \times n symbols drawn from an alphabet Σ\Sigma of size σ\sigma. The query consists of a m×m m \times m square matrix \PP[1\dd m, 1\dd m] drawn from the same alphabet, and the task is to find all the locations of \PP in \M. For such square patterns, data structures such as suffix trees and suffix arrays exist for the task of efficient pattern matching. However, a suffix tree occupies O(NlogN)O(N \log N) bits, which is significantly more than that of the original text\u27s size of NlogσN\log \sigma bits. Therefore, the design of compressed data structures, that supports pattern matching queries efficiently and occupies space close to the original text\u27s size, is imperative. In this thesis, we show an interesting result by designing a compact text index of size O(NloglogN+Nlogσ)O(N \log\log N + N \log\sigma) bits that at least supports efficient inverse suffix array queries. Although, the question of designing a compressed text index that would lead to efficient pattern matching is still evasive, this index gives a hope on the existence of a full 2D compressed text index with all functionalities similar to that of 1D case. On the other hand, the Longest Common 2D substring problem consists of two 2D strings (matrices), and the task is to report the size of the longest common 2D substring (submatrix) of these 2D strings. It is interesting to know if there exists a sublinear-time algorithm for solving this task. We answer this question positively by presenting a sublinear-time \textit{quantum} algorithm. In addition to this, we prove that any quantum algorithm requires at least Ω~(N2/3)\tilde{\Omega}(N^{2/3}) time to solve this problem

    Data Structures and Algorithms for the String Statistics Problem

    Get PDF

    Space-efficient detection of unusual words

    Full text link
    Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ2log2n)O(\sigma^2\log^2 n) bits, where nn is the length of the string and σ\sigma is the size of the alphabet. The size of the stack is o(n)o(n) except for very large values of σ\sigma. We further improve the algorithm by removing its time dependency on σ\sigma, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur\textit{do not occur} in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

    Data structures and algorithms for approximate string matching Zvi Galil, Raffaele Giancarlo

    Get PDF
    This paper surveys techniques for designing efficient sequential and parallel approximate string matching algorithms. Special attention is given to the methods for the construction of data structures that efficiently support primitive operations needed in approximate string matching

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
    corecore