Search CORE

Genome sequence-based species delimitation with confidence intervals and improved distance functions

Author: Auch Alexander F.
Göker Markus
Klenk Hans-Peter
Meier-Kolthoff Jan P.
Publication venue
Publication date: 01/01/2013
Field of study

Background For the last 25 years species delimitation in prokaryotes (Archaea and Bacteria) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. GBDP (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept. Results Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. GBDP was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, GBDP obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions. Conclusions Despite the high accuracy of GBDP-based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de, are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms

OPUS Augsburg

Methods for comparative metagenomics

Author: A Bernal
Alexander F Auch
B Rodriguez-Brito
C Lozupone
C von Mering
CL Wells
CR Woese
D Benson
D Bentley
Daniel C Richter
Daniel H Huson
DB Rusch
DH Huson
DH Huson
FD Ciccarelli
GW Tyson
HN Poinar
J Raes
JC Venter
L Krause
M Ashburner
M Margulies
N Lang-Unnasch
PB Eckburg
PJ Turnbaugh
R Lambert
R Overbeek
RL Tatusov
SF Altschul
SG Tringe
SR Gill
Stephan C Schuster
Suparna Mitra
T Urich
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Metagenomics is a rapidly growing field of research that aims at studying uncultured organisms to understand the true diversity of microbes, their functions, cooperation and evolution, in environments such as soil, water, ancient remains of animals, or the digestive system of animals and humans. The recent development of ultra-high throughput sequencing technologies, which do not require cloning or PCR amplification, and can produce huge numbers of DNA reads at an affordable cost, has boosted the number and scope of metagenomic sequencing projects. Increasingly, there is a need for new ways of comparing multiple metagenomics datasets, and for fast and user-friendly implementations of such approaches. Results This paper introduces a number of new methods for interactively exploring, analyzing and comparing multiple metagenomic datasets, which will be made freely available in a new, comparative version 2.0 of the stand-alone metagenome analysis tool MEGAN. Conclusion There is a great need for powerful and user-friendly tools for comparative analysis of metagenomic data and MEGAN 2.0 will help to fill this gap.</p

Directory of Open Access Journals

Infoscience - École polytechnique fédérale de Lausanne

ScholarBank@NUS

AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses on thousands of taxa

Author: A Jackson
A Stamatakis
Alexander F Auch
Alexandros Stamatakis
CE Robertson
D Begerow
D Begerow
D Zwickl
DA Griffith
DE Soltis
F Ronquist
GW Grimm
H Hansen
J Nannfeldt
J Stevens
Jan Meier-Kolthoff
JP Meier-Kolthoff
K Vanky
K Vanky
K Vanky
M Bollhöfer
M Hendrichs
M Meinilä
M Piepenbring
Markus Göker
MM McMahon
ORP Bininda-Emonds
P Goloboff
P Legendre
P Legendre
P Thomas
R Bauer
R Bauer
R Bauer
R Bauer
R Palese
R Ricklefs
RDM Page
RE Ley
T Huyse
TZ DeSantis
WH Press
Z Bai
Z de Beer
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Background Current tools for Co-phylogenetic analyses are not able to cope with the continuous accumulation of phylogenetic data. The sophisticated statistical test for host-parasite co-phylogenetic analyses implemented in Parafit does not allow it to handle large datasets in reasonable times. The Parafit and DistPCoA programs are the by far most compute-intensive components of the Parafit analysis pipeline. We present AxParafit and AxPcoords (Ax stands for Accelerated) which are highly optimized versions of Parafit and DistPCoA respectively. Results Both programs have been entirely re-written in C. Via optimization of the algorithm and the C code as well as integration of highly tuned BLAS and LAPACK methods AxParafit runs 5–61 times faster than Parafit with a lower memory footprint (up to 35% reduction) while the performance benefit increases with growing dataset size. The MPI-based parallel implementation of AxParafit shows good scalability on up to 128 processors, even on medium-sized datasets. The parallel analysis with AxParafit on 128 CPUs for a medium-sized dataset with an 512 by 512 association matrix is more than 1,200/128 times faster per processor than the sequential Parafit run. AxPcoords is 8–26 times faster than DistPCoA and numerically stable on large datasets. We outline the substantial benefits of using parallel AxParafit by example of a large-scale empirical study on smut fungi and their host plants. To the best of our knowledge, this study represents the largest co-phylogenetic analysis to date. Conclusion The highly efficient AxPcoords and AxParafit programs allow for large-scale co-phylogenetic analyses on several thousands of taxa for the first time. In addition, AxParafit and AxPcoords have been integrated into the easy-to-use CopyCat tool

OPUS Augsburg

Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences

Author: A Rokas
Alexander F Auch
B Snel
Barbara R Holland
BM Moret
BR Holland
D Bryant
D Posada
D Posada
D Posada
D Sankoff
D Sankoff
DH Huson
DH Huson
DL Swofford
FJ Lapointe
FJ Lapointe
FJ Lapointe
GDP Clarke
HH Otu
HJ Bandelt
HS Yoon
J Felsenstein
J Felsenstein
J Felsenstein
J Felsenstein
J Felsenstein
J Leebens-Mack
J Lin
JA Studier
JF Pombert
JJ Faraway
JL Thorne
JT Harper
L Lefkovitch
LS Vinh
LS Wang
M Källersjö
M Li
M Thines
M Wilkinson
Markus Göker
MSY Lee
N Saitou
NA Moran
NN Fast
O Gascuel
P Buneman
P Legendre
P Legendre
P Legendre
P Legendre
P Legendre
P Legendre
P Legendre
R
R Desper
R Desper
RL Charlebois
RR Sokal
S Gribaldo
S Guindon
S Köhler
S Neuvonen
S Vinga
SF Altschul
SM Adl
SR Henz
ST Fitz-Gibbon
Stefan R Henz
T Nishiyama
TD Pham
TR Bachvaroff
V Kunin
V Savolainen
VV Goremykin
VV Goremykin
W Martin
WB Zomlefer
WC Wheeler
WC Wheeler
WC Wheeler
WC Wheeler
WF Doolittle
WJ Murphy
YI Wolf
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all completely sequenced plastid genomes currently available and from a selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN, TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between two sequences from which pairwise similarities and distances are computed in different ways resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny reconstruction is directly estimated by computing a recently described measure of "treelikeness", the so-called δ value, from the respective distance matrices. Additionally, we compare the trees inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the NCBI taxonomy tree of the taxa under study. RESULTS: Our results indicate that, at this taxonomic level, plastid genomes are much more valuable for inferring phylogenies than are mitochondrial genomes, and that distances based on breakpoints are of little use. Distances based on the proportion of "matched" HSP length to average genome length were best for tree estimation. Additionally we found that using TBLASTX instead of BLASTN and, particularly, combining TBLASTX and BLASTN leads to a small but significant increase in accuracy. Other factors do not significantly affect the phylogenetic outcome. The BIONJ algorithm results in phylogenies most in accordance with the current NCBI taxonomy, with NJ and FastME performing insignificantly worse, and STC performing as well if applied to high quality distance matrices. δ values are found to be a reliable predictor of phylogenetic accuracy. CONCLUSION: Using the most treelike distance matrices, as judged by their δ values, distance methods are able to recover all major plant lineages, and are more in accordance with Apicomplexa organelles being derived from "green" plastids than from plastids of the "red" type. GBDP-like methods can be used to reliably infer phylogenies from different kinds of genomic data. A framework is established to further develop and improve such methods. δ values are a topology-independent tool of general use for the development and assessment of distance methods for phylogenetic inference

Directory of Open Access Journals