98,856 research outputs found
2-Dimensional String Problems: Data Structures and Quantum Algorithms
The field of stringology studies algorithms and data structures used for processing strings efficiently. The goal of this thesis is to investigate 2-dimensional (2D) variants of some fundamental string problems, including \textit{Exact Pattern Matching} and \textit{Longest Common Substring}.
In the 2D pattern matching problem, we are given a matrix \M[1\dd n,1\dd n] that consists of symbols drawn from an alphabet of size . The query consists of a square matrix \PP[1\dd m, 1\dd m] drawn from the same alphabet, and the task is to find all the locations of \PP in \M. For such square patterns, data structures such as suffix trees and suffix arrays exist for the task of efficient pattern matching. However, a suffix tree occupies bits, which is significantly more than that of the original text\u27s size of bits. Therefore, the design of compressed data structures, that supports pattern matching queries efficiently and occupies space close to the original text\u27s size, is imperative. In this thesis, we show an interesting result by designing a compact text index of size bits that at least supports efficient inverse suffix array queries. Although, the question of designing a compressed text index that would lead to efficient pattern matching is still evasive, this index gives a hope on the existence of a full 2D compressed text index with all functionalities similar to that of 1D case.
On the other hand, the Longest Common 2D substring problem consists of two 2D strings (matrices), and the task is to report the size of the longest common 2D substring (submatrix) of these 2D strings. It is interesting to know if there exists a sublinear-time algorithm for solving this task. We answer this question positively by presenting a sublinear-time \textit{quantum} algorithm. In addition to this, we prove that any quantum algorithm requires at least time to solve this problem
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
Data structures and algorithms for approximate string matching Zvi Galil, Raffaele Giancarlo
This paper surveys techniques for designing efficient sequential and parallel approximate string matching algorithms. Special attention is given to the methods for the construction of data structures that efficiently support primitive operations needed in approximate string matching
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
- …