Search CORE

1,227 research outputs found

Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing

Author: Alok Choudhary
Altschul
Ankit Agrawal
Campagna
Kent
Langmead
Li
Li
Li
Lupski
Misra
Needleman
Ning
Patrick
Pearson
Pevzner
Rasmussen
Roach
Rothberg
Rumble
Sanchit Misra
Smith
Smith
Wei-keng Liao
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis

Author: A Califano
AD Jayaprakash
AK Emde
Anna Tramontano
B Pasaniuc
C Della Beffa
CE Metz
D Smedley
D Weese
Francesca Cordero
GK Smyth
H Willenbrock
JH Bullard
K Prufer
KR Rasmussen
L Wang
LJ Zhu
M Farrar
M Hackenberg
M Morgan
Maddalena Arigoni
Marco Beccuti
MD Robinson
MD Robinson
MR Friedlander
R Breitling
R Ronen
R Sanges
Raffaele A. Calogero
S Alon
S Anders
S Griffiths-Jones
S Moxon
SM Rumble
Susanna Donatelli
TJ Hardcastle
V Ambros
VG Tusher
W Zheng
WC Wang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

BACKGROUND: Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. PRIMARY FINDINGS: A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. CONCLUSIONS: The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Institutional Research Information System University of Turin

FigShare

Circular sequence comparison: algorithms and applications

Author: Ahmad Retha (7168871)
Costas S. Iliopoulos (7168862)
Fatima Vayani (7168874)
Nadia Pisanti (7168865)
Robert Mercas (2835212)
Roberto Grossi (7168859)
Solon P. Pissis (7168868)
Publication venue
Publication date: 01/01/2016
Field of study

Background: Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. Results: In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art

Loughborough University Institutional Repository

Longest Common Prefixes with $k$ -Errors and Applications

Author: A Apostolico
AF Smit
B Bollobás
C Leimeister
C Pizzi
DE Willard
G Kucherov
G Manzini
G Navarro
H Alamro
I Ulitsky
J Fischer
KR Rasmussen
M Alzamel
MA Bender
MI Abouelhoda
N Välimäki
P Eades
R Kolpakov
S Faro
S Grabowski
S Karlin
SV Thankachan
SV Thankachan
SV Thankachan
T Derrien
T Flouri
TH Cormen
U Manber
Publication venue
Publication date: 01/01/2018
Field of study

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length

n

over a constant-sized alphabet that occurs elsewhere in the string with

k

-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant

k

and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in

\mathcal{O}(n \log^k n \log \log n)

time on average using

\mathcal{O}(n)

space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

arXiv.org e-Print Archive

Crossref

King's Research Portal

The Full Landscape of Robust Mean Testing: Sharp Separations between Oblivious and Adaptive Contamination

Author: Canonne Clément L.
Hopkins Samuel B.
Li Jerry
Liu Allen
Narayanan Shyam
Publication venue
Publication date: 18/07/2023
Field of study

We consider the question of Gaussian mean testing, a fundamental task in high-dimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the information-theoretic and computational landscapes for robust mean testing. In the exponential-time setting, we establish the tight sample complexity of testing

\mathcal{N}(0,I)

against

\mathcal{N}(\alpha v, I)

, where

\|v\|_2 = 1

, with an

\varepsilon

-fraction of adversarial corruptions, to be

\tilde{\Theta}\!\left(\max\left(\frac{\sqrt{d}}{\alpha^2}, \frac{d\varepsilon^3}{\alpha^4},\min\left(\frac{d^{2/3}\varepsilon^{2/3}}{\alpha^{8/3}}, \frac{d \varepsilon}{\alpha^2}\right)\right) \right) \,,

while the complexity against adaptive adversaries is

\tilde{\Theta}\!\left(\max\left(\frac{\sqrt{d}}{\alpha^2}, \frac{d\varepsilon^2}{\alpha^4} \right)\right) \,,

which is strictly worse for a large range of vanishing

\varepsilon,\alpha

. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomial-time setting, we close a gap in the literature by providing a polynomial-time algorithm against adaptive adversaries achieving the above sample complexity

\tilde{\Theta}(\max({\sqrt{d}}/{\alpha^2}, {d\varepsilon^2}/{\alpha^4} ))

, and a low-degree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the oblivious-adversary setting.Comment: To appear in FOCS 202

arXiv.org e-Print Archive

Detecting genomic indel variants with exact breakpoints in single- and paired-end sequencing data using SplazerS

Author: Emde A.-K.
Haas S. A.
Kalscheuer V. M.
Reinert K.
Schulz M. H.
Sun R.
Vingron M.
Weese D.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: The reliable detection of genomic variation in resequencing data is still a major challenge, especially for variants larger than a few base pairs. Sequencing reads crossing boundaries of structural variation carry the potential for their identification, but are difficult to map. Results: Here we present a method for ‘split’ read mapping, where prefix and suffix match of a read may be interrupted by a longer gap in the read-to-reference alignment. We use this method to accurately detect medium-sized insertions and long deletions with precise breakpoints in genomic resequencing data. Compared with alternative split mapping methods, SplazerS significantly improves sensitivity for detecting large indel events, especially in variant-rich regions. Our method is robust in the presence of sequencing errors as well as alignment errors due to genomic mutations/divergence, and can be used on reads of variable lengths. Our analysis shows that SplazerS is a versatile tool applicable to unanchored or single-end as well as anchored paired-end reads. In addition, application of SplazerS to targeted resequencing data led to the interesting discovery of a complete, possibly functional gene retrocopy variant. Availability: SplazerS is available from http://www.seqan.de/projects/ splazers

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

MPG.PuRe

Phylogenetic comparative assembly

Author: A Tauch
A Tauch
D Gordon
DA Benson
DC Richter
DL Wheeler
ER Gansner
ER Mardis
F Sanger
F Zhao
J Blom
J Fredslund
Jens Stoye
JL Bentley
KR Rasmussen
M Pop
N Saitou
Peter Husemann
R Staden
S Altschul
S Anderson
S Kurtz
SAFT van Hijum
WJ Kent
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Husemann P, Stoye J. Phylogenetic Comparative Assembly. Algorithms for Molecular Biology. 2010;5(1): 3.BACKGROUND:Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence. RESULTS:Here we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent. Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary. CONCLUSIONS:Our new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same tim

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

Parallel Natural Language Parsing: From Analysis to Speedup

Author: Lohuizen Marcellus Paulus van
Publication venue: Technische Universiteit Delft
Publication date: 01/01/2001
Field of study

Electrical Engineering, Mathematics and Computer Scienc

TU Delft Repository

University of Twente Research Information

Effective Instance Matching for Heterogeneous Structured Data

Author: Ma Yongtao
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

One main problem towards the effective usage of structured data is instance matching, where the goal is to find instance representations referring to the same real-world thing. In this book we investigate how to effectively match Heterogeneous structured data. We evaluate our approaches against the latest baselines. The results show advances beyond the state-of-the-art

KITopen