Search CORE

7,533 research outputs found

Prospects and limitations of full-text index structures in genome analysis

Author: Dawyndt Peter
De Baets Bernard
Fack Veerle
Vyverman Michaël
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

Ghent University Academic Bibliography

PubMed Central

The k-mismatch problem revisited

Author: Clifford Raphaël
Fontaine Allyx
Porat Ely
Sach Benjamin
Starikovskaya Tatiana
Publication venue
Publication date: 27/08/2015
Field of study

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic

O(n k^2\log{k} / m+n \text{polylog} m)

time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only

O(k^2\text{polylog} {m})

space in total. 3) Next we give a randomised

(1+\epsilon)

-approximation algorithm for the streaming k-mismatch problem which uses

O(k^2\text{polylog} m / \epsilon^2)

space and runs in

O(\text{polylog} m / \epsilon^2)

worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised

O(k^2\text{polylog} {m})

space algorithm for the streaming k-mismatch problem which runs in

O(\sqrt{k}\log{k} + \text{polylog} {m})

worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit

Author: Crochemore Maxime
Landau Gad M.
Schieber Baruch
Ziv-Ukelson Michal
Publication venue: King's College London Publications
Publication date: 01/01/2005
Field of study

International audienceThe problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Hardness of longest common subsequence for sequences with bounded run-lengths

Author: Blin Guillaume
Bulteau Laurent
Jiang Minghui
Tejada Pedro J.
Vialette Stéphane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/07/2012
Field of study

International audienceThe longest common subsequence (LCS) problem is a classic and well-studied problem in computer science with extensive applications in diverse areas ranging from spelling error corrections to molecular biology. This paper focuses on LCS for fixed alphabet size and fixed run-lengths (i.e., maximum number of consecutive occurrences of the same symbol). We show that LCS is NP-complete even when restricted to (i) alphabets of size 3 and run-length at most 1, and (ii) alphabets of size 2 and run-length at most 2 (both results are tight). For the latter case, we show that the problem is approximable within ratio 3/5

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

A two-base encoded DNA sequence alignment problem in computational biology

Author: Chang Jen-Mei
Drakes Chiaka
Huang Erya
Langat Deidrey
McGrath Joseph
Morabito Mark
Pacheco Jose
Rodriguez Nancy
Salazar Daniel
Vemuri Rao
Vu Man
Wadhar Hem
Wu Qin
Publication venue
Publication date: 01/01/2009
Field of study

The recent introduction of instruments capable of producing millions of DNA sequence reads in a single run is rapidly changing the landscape of genetics. The primary objective of the "sequence alignment" problem is to search for a new algorithm that facilitates the use of two-base encoded data for large-scale re-sequencing projects. This algorithm should be able to perform local sequence alignment as well as error detection and correction in a reliable and systematic manner, enabling the direct comparison of encoded DNA sequence reads to a candidate reference DNA sequence. We will first briefly review two well-known sequence alignment approaches and provide a rudimentary improvement for implementation on parallel systems. Then, we carefully examin a unique sequencing technique known as the SOLiDTM System that can be implemented, and follow by the results from the global and local sequence alignment. In this report, the team presents an explanation of the algorithms for color space sequence data from the high-throughput re-sequencing technology and a theoretical parallel approach to the dynamic programming method for global and local alignment. The combination of the di-base approach and dynamic programming provides a possible viewpoint for large-scale re-sequencing projects. We anticipate the use of distributed computing to be the next-generation engine for large-scale problems like such