Search CORE

98,856 research outputs found

2-Dimensional String Problems: Data Structures and Quantum Algorithms

Author: Patel Dhrumilkumar
Publication venue: LSU Digital Commons
Publication date: 26/07/2022
Field of study

The field of stringology studies algorithms and data structures used for processing strings efficiently. The goal of this thesis is to investigate 2-dimensional (2D) variants of some fundamental string problems, including \textit{Exact Pattern Matching} and \textit{Longest Common Substring}. In the 2D pattern matching problem, we are given a matrix \M[1\dd n,1\dd n] that consists of

N = n \times n

symbols drawn from an alphabet

\Sigma

of size

\sigma

. The query consists of a

m \times m

square matrix \PP[1\dd m, 1\dd m] drawn from the same alphabet, and the task is to find all the locations of \PP in \M. For such square patterns, data structures such as suffix trees and suffix arrays exist for the task of efficient pattern matching. However, a suffix tree occupies

O(N \log N)

bits, which is significantly more than that of the original text\u27s size of

N\log \sigma

bits. Therefore, the design of compressed data structures, that supports pattern matching queries efficiently and occupies space close to the original text\u27s size, is imperative. In this thesis, we show an interesting result by designing a compact text index of size

O(N \log\log N + N \log\sigma)

bits that at least supports efficient inverse suffix array queries. Although, the question of designing a compressed text index that would lead to efficient pattern matching is still evasive, this index gives a hope on the existence of a full 2D compressed text index with all functionalities similar to that of 1D case. On the other hand, the Longest Common 2D substring problem consists of two 2D strings (matrices), and the task is to report the size of the longest common 2D substring (submatrix) of these 2D strings. It is interesting to know if there exists a sublinear-time algorithm for solving this task. We answer this question positively by presenting a sublinear-time \textit{quantum} algorithm. In addition to this, we prove that any quantum algorithm requires at least

\tilde{\Omega}(N^{2/3})

time to solve this problem

Louisiana State University

Data Structures and Algorithms for the String Statistics Problem

Author: Apostolico Alberto
Preparata Franco P.
Publication venue: 'Purdue University (bepress)'
Publication date: 01/12/1993
Field of study

Purdue E-Pubs

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Data structures and algorithms for approximate string matching Zvi Galil, Raffaele Giancarlo

Author: Galil Zvi
Giancarlo Raffaele
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1987
Field of study

This paper surveys techniques for designing efficient sequential and parallel approximate string matching algorithms. Special attention is given to the methods for the construction of data structures that efficiently support primitive operations needed in approximate string matching

Elsevier - Publisher Connector

Columbia University Academic Commons

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central