Search CORE

MPI-PHYLIP: Parallelizing Computationally Intensive Phylogenetic Analysis Routines for the Analysis of Large Protein Families

Author: A Stamatakis
A Stamatakis
A Stamatakis
AE Darling
Alexander J. Ropelewski
AP Stamatakis
B Reva
BQ Minh
CA Russo
CJ Douady
D Durand
D Durand
D Kordis
DA Janies
E Mayr
F Delsuc
FD Ciccarelli
FR Opperdoes
G Altekar
G Talavera
GJ Olsen
HA Schmidt
Hugh B. Nicholas
I. King Jordan
J Felsenstein
J Felsenstein
J Felsenstein
J Hempel
J Perozich
JA Sheps
JT Bridgham
K Hamacher
K Tamura
KB Li
MJ Sanderson
Ricardo R. Gonzalez Mendez
S Sankararaman
S Yang
SB Hedges
SL Kosakovsky-Pond
T Wymore
TL Williams
TM Keane
TM Keane
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Background: Phylogenetic study of protein sequences provides unique and valuable insights into the molecular and genetic basis of important medical and epidemiological problems as well as insights about the origins and development of physiological features in present day organisms. Consensus phylogenies based on the bootstrap and other resampling methods play a crucial part in analyzing the robustness of the trees produced for these analyses. Methodology: Our focus was to increase the number of bootstrap replications that can be performed on large protein datasets using the maximum parsimony, distance matrix, and maximum likelihood methods. We have modified the PHYLIP package using MPI to enable large-scale phylogenetic study of protein sequences, using a statistically robust number of bootstrapped datasets, to be performed in a moderate amount of time. This paper discusses the methodology used to parallelize the PHYLIP programs and reports the performance of the parallel PHYLIP programs that are relevant to the study of protein evolution on several protein datasets. Conclusions: Calculations that currently take a few days on a state of the art desktop workstation are reduced to calculations that can be performed over lunchtime on a modern parallel computer. Of the three protein methods tested, the maximum likelihood method scales the best, followed by the distance method, and then the maximum parsimony method. However, the maximum likelihood method requires significant memory resources, which limits its application to mor

CiteSeerX

Plated Cambrian Bilaterians Reveal the Earliest Stages of Echinoderm Evolution

Author: AB Smith
AB Smith
AB Smith
AB Smith
Andrew B. Smith
B David
B David
BJ Swalla
CB Cameron
CD Sumrall
CRC Paul
CW Dunn
D Janies
D Pisani
DJ Bottjer
DTJ Littlewood
FA Bather
G Ubaghs
IA Rahman
Imran A. Rahman
J Esteve
J Mallatt
J Sprinkle
JJ Álvaro
JT Cannon
K Osborn
K Sdzuy
Keith A. Crandall
KJ Peterson
M Pereske
MD Sutton
N Lartillot
O Fatka
P Domínguez-Alonso
RA Robison
S Zamora
S Zamora
Samuel Zamora
Publication venue: Public Library of Science
Publication date: 06/06/2012
Field of study

Echinoderms are unique in being pentaradiate, having diverged from the ancestral bilaterian body plan more radically than any other animal phylum. This transformation arises during ontogeny, as echinoderm larvae are initially bilateral, then pass through an asymmetric phase, before giving rise to the pentaradiate adult. Many fossil echinoderms are radial and a few are asymmetric, but until now none have been described that show the original bilaterian stage in echinoderm evolution. Here we report new fossils from the early middle Cambrian of southern Europe that are the first echinoderms with a fully bilaterian body plan as adults. Morphologically they are intermediate between two of the most basal classes, the Ctenocystoidea and Cincta. This provides a root for all echinoderms and confirms that the earliest members were deposit feeders not suspension feeders

The Francis Crick Institute

Explore Bristol Research

Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases

Author: A Moilanen
A Phillips
A Tehler
AR Lemmon
B Budowle
B Chang
B Grenfell
B Rannala
B Rannala
BD Redelings
BE Martina
C Ceron
C Scholtissek
D Earn
D Franz
D Janies
D Janies
D Morrison
D Pol
D Sankoff
D Searls
DJ Zwickl
DL Swofford
DL Swofford
DL Swofford
DM Hillis
DM Hillis
E Ghedin
E Holmes
E Ukkonen
EM Rubin
G Laver
H Song
J Antonovics
J Felsenstein
J Felsenstein
J Felsenstein
J Huelsenbeck
J Plotkin
J Silvertown
J Thornton
JD Thompson
JK Taubenberger
JK Taubenberger
JK Taubenberger
JK Taubenberger
JL Thorne
JP Carulli
JS Farris
JS Farris
JS Farris
K Li
K Li
K Ungchusak
KC Nixon
KC Nixon
KP White
L Wang
L Watrous
LA Salter
LH Taylor
LR Foulds
M Gammelin
M Gibbs
M Koopmans
M Metzker
MA Charleston
MA Marra
MD Hendy
MJ Brauer
N Saitou
NM Ferguson
NM Ferguson
P Palese
PA Goloboff
PA Goloboff
PA Rota
PO Lewis
Q Wang
R Fleissner
RG Webster
RM Bush
RM Bush
RM Bush
RS Ross
S Lau
S Li
S Morse
S Poe
T Fanning
T Grant
T Ksiazek
The Chinese SARS Molecular Epidemiology Consortium
W Hennig
W Li
W Wheeler
W Wheeler
WC Wheeler
WC Wheeler
WM Fitch
WM Fitch
WM Fitch
Y Guan
Y Guan
Y Lin
Y Suzuki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare. For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing. The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition. Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees. We present novel analytical results on from two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza.SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems. Influenza viruses (Orthymxyoviridae) are RNA based. Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts. These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However we still do not have a clear understanding of: 1) various transmission pathways such as the role of intermediate hosts such as swine and domestic birds and 2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza. In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent. These data present exciting opportunities to address unanswered questions in influenza pandemics.SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome. Although SARS-CoV is widely believed to have originated in animals there remains disagreement over the candidate animal source that lead to the original outbreak of SARS. In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing.In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem. Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks {JON03}. Thus comprehensive means to organize and analyze large amounts of diverse information are critical. For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data. Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries.Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios. The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses. However, this synthesis comes at a price. The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable. We address this problem with synergistic development of heuristics tree search strategies and parallel computing.Fil: Janies, D.. Ohio State University; Estados UnidosFil: Pol, Diego. Ohio State University; Estados Unidos. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

CORE: A Phylogenetically-Curated 16S rDNA Database of the Core Oral Microbiome

Author: A de Lillo
A Stamatakis
A Stamatakis
Ann L. Griffen
Bastienne Vriesendorp
BJ Paster
CA Lozupone
CE Kazor
CE Shannon
Clifford J. Beall
CM Zmasek
D Dymock
Daniel A. Janies
DH Huson
DM Hillis
Erin L. Gross
Eugene J. Leys
F Armougom
FE Dewhirst
I Letunic
JA Aas
JA Aas
James M. DiFranco
JG Caporaso
Jonathan H. Badger
Jori H. Hardman
JR Cole
KE Ashelford
KE Ashelford
MA Larkin
MA Munson
Noah D. Firestone
P Hugenholtz
PD Schloss
PD Schloss
PS Kumar
PS Kumar
Q Wang
Russell A. Faust
S Chakravorty
SF Altschul
T Chen
W Li
WJ Kent
WP Maddison
Publication venue: Public Library of Science
Publication date: 01/04/2011
Field of study

Comparing bacterial 16S rDNA sequences to GenBank and other large public databases via BLAST often provides results of little use for identification and taxonomic assignment of the organisms of interest. The human microbiome, and in particular the oral microbiome, includes many taxa, and accurate identification of sequence data is essential for studies of these communities. For this purpose, a phylogenetically curated 16S rDNA database of the core oral microbiome, CORE, was developed. The goal was to include a comprehensive and minimally redundant representation of the bacteria that regularly reside in the human oral cavity with computationally robust classification at the level of species and genus. Clades of cultivated and uncultivated taxa were formed based on sequence analyses using multiple criteria, including maximum-likelihood-based topology and bootstrap support, genetic distance, and previous naming. A number of classification inconsistencies for previously named species, especially at the level of genus, were resolved. The performance of the CORE database for identifying clinical sequences was compared to that of three publicly available databases, GenBank nr/nt, RDP and HOMD, using a set of sequencing reads that had not been used in creation of the database. CORE offered improved performance compared to other public databases for identification of human oral bacterial 16S sequences by a number of criteria. In addition, the CORE database and phylogenetic tree provide a framework for measures of community divergence, and the focused size of the database offers advantages of efficiency for BLAST searching of large datasets. The CORE database is available as a searchable interface and for download at http://microbiome.osu.edu

Potential pitfalls of modelling ribosomal RNA data in phylogenetic tree reconstruction: Evidence from case studies in the Metazoa

Author: A Gibson
A Graybeal
A Keller
A Kurabayashi
A Rzhetsky
A Scouras
A Smith
A Smith
A Stamatakis
A Stamatakis
B Misof
BJ Swalla
BM von Reumont
C Hudelot
C Simon
C Woese
D Erpenbeck
D Hillis
D Janies
D Littlewood
D Pollock
D Swofford
D Zwickl
DF Robinson
DM Hillis
DM Hillis
E Tillier
E Tillier
G Fleck
G Tsagkogeorga
G Tsagkogeorga
H Jow
H Philippe
H Wada
H Yang
Harald O Letsch
HO Letsch
J Cannone
J Gillespie
J Gillespie
J Gillespie
J Mears
J Parsch
J Sullivan
J Ware
J Wuyts
K Katoh
K Kjer
K Kjer
K Kjer
K Kjer
Karl M Kjer
L Zeng
L Zeng
M Dohrmann
M Dohrmann
M Ott
M Salemi
M Schoeniger
M Telford
M Yusupov
MJ Stanhope
MS Springer
MS Springer
MS Springer
MT Dixon
N Galtier
N Savill
O Madsen
O Niehuis
P De Rijk
P Higgs
P Lopez
R Gutell
RR Stocsits
V Gowri-Shankar
V Gowri-Shankar
W Murphy
W Murphy
W Wheeler
X Xia
X Xia
Y Van de Peer
Y Van de Peer
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Failure to account for covariation patterns in helical regions of ribosomal RNA (rRNA) genes has the potential to misdirect the estimation of the phylogenetic signal of the data. Furthermore, the extremes of length variation among taxa, combined with regional substitution rate variation can mislead the alignment of rRNA sequences and thus distort subsequent tree reconstructions. However, recent developments in phylogenetic methodology now allow a comprehensive integration of secondary structures in alignment and tree reconstruction analyses based on rRNA sequences, which has been shown to correct some of these problems. Here, we explore the potentials of RNA substitution models and the interactions of specific model setups with the inherent pattern of covariation in rRNA stems and substitution rate variation among loop regions. Results We found an explicit impact of RNA substitution models on tree reconstruction analyses. The application of specific RNA models in tree reconstructions is hampered by interaction between the appropriate modelling of covarying sites in stem regions, and excessive homoplasy in some loop regions. RNA models often failed to recover reasonable trees when single-stranded regions are excessively homoplastic, because these regions contribute a greater proportion of the data when covarying sites are essentially downweighted. In this context, the RNA6A model outperformed all other models, including the more parametrized RNA7 and RNA16 models. Conclusions Our results depict a trade-off between increased accuracy in estimation of interdependencies in helical regions with the risk of magnifying positions lacking phylogenetic signal. We can therefore conclude that caution is warranted when applying rRNA covariation models, and suggest that loop regions be independently screened for phylogenetic signal, and eliminated when they are indistinguishable from random noise. In addition to covariation and homoplasy, other factors, like non-stationarity of substitution rates and base compositional heterogeneity, can disrupt the signal of ribosomal RNA data. All these factors dictate sophisticated estimation of evolutionary pattern in rRNA data, just as other molecular data require similarly complicated (but different) corrections.</p

Springer - Publisher Connector

Springer

Influenza A H5N1 Immigration Is Filtered Out at Some International Borders

Geographic spread of highly pathogenic influenza A H5N1, the bird flu strain, appears a necessary condition for accelerating the evolution of a related human-to-human infection. As H5N1 spreads the virus diversifies in response to the variety of socioecological environments encountered, increasing the chance a human infection emerges. Genetic phylogenies have for the most part provided only qualitative evidence that localities differ in H5N1 diversity. For the first time H5N1 variation is quantified across geographic space.We constructed a statistical phylogeography of 481 H5N1 hemagglutinin genetic sequences from samples collected across 28 Eurasian and African localities through 2006. The MigraPhyla protocol showed southern China was a source of multiple H5N1 strains. Nested clade analysis indicated H5N1 was widely dispersed across southern China by both limited dispersal and long distance colonization. The UniFrac metric, a measure of shared phylogenetic history, grouped H5N1 from Indonesia, Japan, Thailand and Vietnam with those from southeastern Chinese provinces engaged in intensive international trade. Finally, H5N1's accumulative phylogenetic diversity was greatest in southern China and declined beyond. The gradient was interrupted by areas of greater and lesser phylogenetic dispersion, indicating H5N1 migration was restricted at some geopolitical borders. Thailand and Vietnam, just south of China, showed significant phylogenetic clustering, suggesting newly invasive H5N1 strains have been repeatedly filtered out at their northern borders even as both countries suffered recurring outbreaks of endemic strains. In contrast, Japan, while successful in controlling outbreaks, has been subjected to multiple introductions of the virus.The analysis demonstrates phylogenies can provide local health officials with more than hypotheses about relatedness. Pathogen dispersal, the functional relationships among disease ecologies across localities, and the efficacy of control efforts can also be inferred, all from viral genetic sequences alone

CiteSeerX

eScholarship - University of California

Nephele: genotyping via complete composition vectors and MapReduce

Author: A Drummond
A McKenna
A Rambaut
AL Ghindilis
AS De Groot
AS Fauci
B Budowle
BJ Frey
C Macken
C Notredame
C Ranger
CA Cummings
CB Do
D Janies
D Wang
E Gabriel
EC Holmes
G Giribet
G Lin
G Lu
HL Yang
IM Wallace
J Bullard
J Dean
JC Wilgenbusch
JD Retief
JD Thompson
KH Chu
KS Li
L Campitelli
L Gao
L Stuyver
Lynette Hirschman
M Colosimo
M Li
M Lindh
Marc E Colosimo
Matthew W Peterson
MC Schatz
MW Peterson
N Saitou
RC Edgar
RC Edgar
SA McEwen
Scott Mardis
SJ Matthews
T Hughes
TB Reddy
TZ DeSantis Jr
U Rost
V Brendel
X Wu
X Wu
XF Wan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences. Results Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours. Conclusions We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p

Springer - Publisher Connector

Phylogeny of Echinoderm Hemoglobins

Author: A Foettinger
A Roesner
A Stamatakis
AA Schaffer
AB Christensen
AB Christensen
AB Christensen
Ana B. Christensen
C Bonaventura
C Fuchs
C Lechauve
C Manwell
C Manwell
C Manwell
CB Do
CJ Challis
CP Mangum
D Bashford
D Hoogewijs
D Hoogewijs
D Kugelstadt
D Wilson
Daniel Janies
David Hoogewijs
Dean C. Semmens
DJ Bottjer
DR Smith
DT Mitchell
E Sodergren
ER Lankester
F Abascal
F Mauri
F Ronquist
F Sievers
FG Hoffmann
FG Hoffmann
FG Hoffmann
GB Kitto
GD McDonald
Gregorio Linchangco
H Du
Hector Escriva
J Droge
J Shi
JF Storz
JF Storz
JL Herman
JM Lawrence
Joseph L. Herman
JT Trent 3rd
JW Murray
K Katoh
K Tamura
KM Kober
Kord M. Kober
LH Hyman
LJ Parkhurst
M Blank
M Kondo
MA Miller
Maurice R. Elphick
ML Rowe
N Kawada
O Penn
O Penn
P Tommaso
Q Tu
RA Cameron
RC Edgar
RC Steinmeier
RC Terwilliger
RC Terwilliger
RC Terwilliger
RE Weber
S Whelan
Serge N. Vinogradov
SL Hajduk
SM Baker
SN Vinogradov
SN Vinogradov
SN Vinogradov
SN Vinogradov
SN Vinogradov
T Burmester
T Burmester
T Burmester
T Hankeln
T Suzuki
VJ Smith
X Bailly
Xavier Bailly
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Recent genomic information has revealed that neuroglobin and cytoglobin are the two principal lineages of vertebrate hemoglobins, with the latter encompassing the familiar myoglobin and α-globin/β-globin tetramer hemoglobin, and several minor groups. In contrast, very little is known about hemoglobins in echinoderms, a phylum of exclusively marine organisms closely related to vertebrates, beyond the presence of coelomic hemoglobins in sea cucumbers and brittle stars. We identified about 50 hemoglobins in sea urchin, starfish and sea cucumber genomes and transcriptomes, and used Bayesian inference to carry out a molecular phylogenetic analysis of their relationship to vertebrate sequences, specifically, to assess the hypothesis that the neuroglobin and cytoglobin lineages are also present in echinoderms.The genome of the sea urchin Strongylocentrotus purpuratus encodes several hemoglobins, including a unique chimeric 14-domain globin, 2 androglobin isoforms and a unique single androglobin domain protein. Other strongylocentrotid genomes appear to have similar repertoires of globin genes. We carried out molecular phylogenetic analyses of 52 hemoglobins identified in sea urchin, brittle star and sea cucumber genomes and transcriptomes, using different multiple sequence alignment methods coupled with Bayesian and maximum likelihood approaches. The results demonstrate that there are two major globin lineages in echinoderms, which are related to the vertebrate neuroglobin and cytoglobin lineages. Furthermore, the brittle star and sea cucumber coelomic hemoglobins appear to have evolved independently from the cytoglobin lineage, similar to the evolution of erythroid oxygen binding globins in cyclostomes and vertebrates.The presence of echinoderm globins related to the vertebrate neuroglobin and cytoglobin lineages suggests that the split between neuroglobins and cytoglobins occurred in the deuterostome ancestor shared by echinoderms and vertebrates