Search CORE

292 research outputs found

Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis

Author: Daskalakis Constantinos
Roch Sebastien
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/02/2012
Field of study

We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.Comment: Published in at http://dx.doi.org/10.1214/12-AAP852 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

DSpace@MIT

Crossref

RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Author: Hahn Lars
Leimeister Chris-André
Lonardi Stefano
Morgenstern Burkhard
Ounit Rachid
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 20/07/2016
Field of study

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Genome sequence-based species delimitation with confidence intervals and improved distance functions

Author: Auch Alexander F.
Göker Markus
Klenk Hans-Peter
Meier-Kolthoff Jan P.
Publication venue
Publication date: 01/01/2013
Field of study

Background For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept. Results Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. GBDP was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, GBDP obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions. Conclusions Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms

OPUS Augsburg

Springer - Publisher Connector

PubMed Central

Alignment-Free Phylogenetic Reconstruction

Author: A. Loytynoja
B.D. Thatte
C. Daskalakis
C. Daskalakis
C. Daskalakis
C. Semple
D. Graur
D. Metzler
D.G. Higgins
E. Mossel
E. Mossel
I. Elias
I. Gronau
I. Miklos
J. Felsenstein
J.L. Thorne
J.L. Thorne
K. Atteson
K. Katoh
K. Liu
K.B. Athreya
K.M. Wong
L. Wang
M. Csurös
M. Csurös
M. Hohl
M.A. Steel
M.A. Steel
M.A. Suchard
M.R. Lacey
P. Buneman
P.L. Erdös
P.L. Erdös
R.C. Edgar
S. Karlin
V. King
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, April 25-28, 2010. ProceedingsWe introduce the first polynomial-time phylogenetic reconstruction algorithm under a model of sequence evolution allowing insertions and deletions (or indels). Given appropriate assumptions, our algorithm requires sequence lengths growing polynomially in the number of leaf taxa. Our techniques are distance-based and largely bypass the problem of multiple alignment

CiteSeerX

DSpace@MIT

Crossref

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Author: A Dress
A Godzik
A Löytynoja
A Löytynoja
A Novák
A Novák
A Sali
A Siepel
A Tramontano
Adrienn Szabó
AS Schwartz
AS Schwartz
B Dwivedi
B Knudsen
B Larget
B Misof
B Schwikowski
BD Redelings
BD Redelings
BJM Webb
BP Blackburne
C Dessimoz
C Notredame
C Notredame
CB Do
CJ Challis
D Altschuh
D Chivian
D DeBlasio
D Lupyan
D Metzler
D Metzler
D Robinson
DA Morrison
DF Feng
E Levy Karin
G Jordan
G Landan
G Lunter
G Lunter
G Lunter
G Raghava
G Talavera
GA Churchill
GA Lunter
Hall B G
HT Mevissen
I Holmes
I Miklós
I Miklós
IL Dryden
IM Wallace
István Miklós
J Castresana
J Felsenstein
J Gatesy
J Hein
J Kim
J Zhu
JA Lake
JD Thompson
JD Thompson
JL Thorne
JL Thorne
JL Thorne
JL Thorne
Joseph L Herman
Jotun Hein
K Bucka-Lassen
K Liu
K Liu
KM Wong
L Wang
L Yu
LE Carvalho
LS Wang
M Hamada
M Hamada
M Hamada
M Höhl
M Vingron
M Vingron
M Wu
M Zuker
MA Suchard
MJ Wise
MO Dayhoff
MP Simmons
MS Waterman
MSY Lee
O Gotoh
O Penn
O Penn
O Penn
P Ajawatanawong
P Arunapuram
P Collingridge
PJ Green
PJ Green
PP Gardner
R Durbin
R Satija
R Satija
R Schwarzenbacher
RA Cartwright
RC Edgar
RJ Dickson
RJ Dickson
RK Bradley
Rune Lyngsø
S Capella-Gutiérrez
S Karlin
S Miyazawa
S Needleman
S Sinha
Silla-Martínez Capella-Gutiérrez S
SME Sahraeian
TA Hopf
TH Ogden
TL Blundell
U Roshan
V Ahola
W Fletcher
WC Wheeler
Y Liu
Y Ruffieux
Ádám Novák
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

Crossref

SZTAKI Publication Repository

Springer - Publisher Connector

PubMed Central

Oxford University Research Archive