Search CORE

700 research outputs found

A Bloom filter based semi-index on $q$ -grams

Author: Grabowski Szymon
Raniszewski Marcin
Susik Robert
Publication venue
Publication date: 10/07/2015
Field of study

We present a simple

q

-gram based semi-index, which allows to look for a pattern typically only in a small fraction of text blocks. Several space-time tradeoffs are presented. Experiments on Pizza & Chili datasets show that our solution is up to three orders of magnitude faster than the Claude et al. \cite{CNPSTjda10} semi-index at a comparable space usage

arXiv.org e-Print Archive

RazerS - Fast Read Mapping with Sensitivity Control

Author: Döring A.
Emde A.-K.
Rausch T.
Reinert K.
Weese D.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 10/07/2009
Field of study

Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, eﬃcient algorithms and implementations are crucial for this task. We present an eﬃcient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-deﬁned loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than speciﬁed. This enables the user to adapt to the problem at hand and provides a seamless tradeoﬀ between sensitivity and running time

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

PubMed Central

TRStalker: an Efficient Heuristic for Finding NP-Complete Tandem Repeats

Author: Pellegrini Marco
Renda Maria Elena
Vecchio Alessio
Publication venue
Publication date
Field of study

Genomic sequences in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage, are characterized by close spatial contiguity, and play an important role in several molecular regulatory mechanisms. Certain types of tandem repeats are highly polymorphic and constitute a fingerprint feature of individuals. Abnormal TRs are known to be linked to several diseases. Researchers in bio-informatics in the last 20 years have proposed many formal definitions for the rather loose notion of a Tandem Repeat and have proposed exact or heuristic algorithms to detect TRs in genomic sequences. The general trend has been to use formal (implicit or explicit) definitions of TR for which verification of the solution was easy (with complexity linear, or polynomial in the TR\u27s length and substitution+indel rates) while the effort was directed towards identifying efficiently the sub-strings of the input to submit to the verification phase (either implicitly or explicitly). In this paper we take a step forward: we use a definition of TR for which also the verification step is difficult (in effect, NP-complete) and we develop new filtering techniques for coping with high error levels. The resulting heuristic algorithm, christened TRStalker, is approximate since it cannot guarantee that all NP-Complete Tandem Repeats satisfying the target definition in the input string will be found. However, in synthetic experiments with 30% of errors allowed, TRStalker has demonstrated a very high recall (ranging from 100% to 60%, depending on motif length and repetition number) for the NP-complete TRs. TRStalker has consistently better performance than some stateof- the-art methods for a large range of parameters on the class of NP-complete Tandem Repeats. TRStalker aims at improving the capability of TR detection for classes of TRs for which existing methods do not perform well

PUblication MAnagement

Languages of lossless seeds

Author: Břinda Karel
Publication venue: 'Open Publishing Association'
Publication date: 21/05/2014
Field of study

Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that lossless seeds coincide with languages of certain sofic subshifts, hence they can be recognized by finite automata. Moreover, we show that these subshifts are fully given by the number of allowed errors k and the seed margin l. We also show that for a fixed k, optimal seeds must asymptotically satisfy l ~ m^(k/(k+1)).Comment: In Proceedings AFL 2014, arXiv:1405.527

arXiv.org e-Print Archive

Directory of Open Access Journals

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing

Author: Alok Choudhary
Altschul
Ankit Agrawal
Campagna
Kent
Langmead
Li
Li
Li
Lupski
Misra
Needleman
Ning
Patrick
Pearson
Pevzner
Rasmussen
Roach
Rothberg
Rumble
Sanchit Misra
Smith
Smith
Wei-keng Liao
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Compressed Spaced Suffix Arrays

Author: Gagie Travis
Manzini Giovanni
Valenzuela Daniel
Publication venue
Publication date: 01/01/2014
Field of study

Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice

arXiv.org e-Print Archive

CiteSeerX

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale