Search CORE

3,861 research outputs found

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 04/07/2019
Field of study

Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Optimal-Time Text Indexing in BWT-runs Bounded Space

Author: Gagie Travis
Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 11/07/2017
Field of study

Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is

r

, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used

O(r)

space and was able to efficiently count the number of occurrences of a pattern of length

m

in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of

r

. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the

occ

occurrences efficiently within

O(r)

space (in loglogarithmic time each), and reaching optimal time

O(m+occ)

within

O(r\log(n/r))

space, on a RAM machine of

w=\Omega(\log n)

bits. Within

O(r\log (n/r))

space, our index can also count in optimal time

O(m)

. Raising the space to

O(r w\log_\sigma(n/r))

, we support count and locate in

O(m\log(\sigma)/w)

and

O(m\log(\sigma)/w+occ)

time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using

O(r\log(n/r))

space that replaces the text and extracts any text substring of length

\ell

in almost-optimal time

O(\log(n/r)+\ell\log(\sigma)/w)

. (...continues...

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Double String Tandem Repeats

Author: Amir Amihood
Butman Ayelet
Landau Gad M.
Marcus Shoshana
Sokol Dina
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

A tandem repeat is an occurrence of two adjacent identical substrings. In this paper, we introduce the notion of a double string, which consists of two parallel strings, and we study the problem of locating all tandem repeats in a double string. The problem introduced here has applications beyond actual double strings, as we illustrate by solving two different problems with the algorithm of the double string tandem repeats problem. The first problem is that of finding all corner-sharing tandems in a 2-dimensional text, defined by Apostolico and Brimkov. The second problem is that of finding all scaled tandem repeats in a 1d text, where a scaled tandem repeat is defined as a string UU\u27 such that U\u27 is discrete scale of U. In addition to the algorithms for exact tandem repeats, we also present algorithms that solve the problem in the inexact sense, allowing up to k mismatches. We believe that this framework will open a new perspective for other problems in the future

Dagstuhl Research Online Publication Server

Approximate Hamming distance in a stream

Author: Clifford Raphael
Starikovskaya Tatiana
Publication venue
Publication date: 01/01/2016
Field of study

We consider the problem of computing a

(1+\epsilon)

-approximation of the Hamming distance between a pattern of length

n

and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an

O(\epsilon^{-4} \log^2 n)

bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an

O(\epsilon^{-2}\sqrt{n}\log n)

bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for

(1+\epsilon)

-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an

O(\epsilon^{-3} \sqrt{n} \log^{2} n)

space and

O(\epsilon^{-2} \log{n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm. (2) For general input alphabets there is an

O(\epsilon^{-5} \sqrt{n} \log^{4} n)

space and

O(\epsilon^{-4} \log^3 {n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Explore Bristol Research

Pattern matching of compressed terms and contexts and polynomial rewriting

Author: Schmidt-Schauß Manfred
Publication venue
Publication date: 01/01/2011
Field of study

A generalization of the compressed string pattern match that applies to terms with variables is investigated: Given terms s and t compressed by singleton tree grammars, the task is to find an instance of s that occurs as a subterm in t. We show that this problem is in NP and that the task can be performed in time O(ncjVar(s)j), including the construction of the compressed substitution, and a representation of all occurrences. We show that the special case where s is uncompressed can be performed in polynomial time. As a nice application we show that for an equational deduction of t to t0 by an equality axiom l = r (a rewrite) a single step can be performed in polynomial time in the size of compression of t and l; r if the number of variables is fixed in l. We also show that n rewriting steps can be performed in polynomial time, if the equational axioms are compressed and assumed to be constant for the rewriting sequence. Another potential application are querying mechanisms on compressed XML-data bases

CiteSeerX

Hochschulschriftenserver - Universität Frankfurt am Main