Search CORE

6,215 research outputs found

Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study

Author: Bentley SD
Colijn C
Harris SR
Kendall M
Lees JA
Parkhill J
Publication venue: 'F1000 Research Ltd'
Publication date: 01/01/2018
Field of study

Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined "true tree" using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons

Oxford University Research Archive

Spiral - Imperial College Digital Repository

MPI-PHYLIP: Parallelizing Computationally Intensive Phylogenetic Analysis Routines for the Analysis of Large Protein Families

Author: A Stamatakis
A Stamatakis
A Stamatakis
AE Darling
Alexander J. Ropelewski
AP Stamatakis
B Reva
BQ Minh
CA Russo
CJ Douady
D Durand
D Durand
D Kordis
DA Janies
E Mayr
F Delsuc
FD Ciccarelli
FR Opperdoes
G Altekar
G Talavera
GJ Olsen
HA Schmidt
Hugh B. Nicholas
I. King Jordan
J Felsenstein
J Felsenstein
J Felsenstein
J Hempel
J Perozich
JA Sheps
JT Bridgham
K Hamacher
K Tamura
KB Li
MJ Sanderson
Ricardo R. Gonzalez Mendez
S Sankararaman
S Yang
SB Hedges
SL Kosakovsky-Pond
T Wymore
TL Williams
TM Keane
TM Keane
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Background: Phylogenetic study of protein sequences provides unique and valuable insights into the molecular and genetic basis of important medical and epidemiological problems as well as insights about the origins and development of physiological features in present day organisms. Consensus phylogenies based on the bootstrap and other resampling methods play a crucial part in analyzing the robustness of the trees produced for these analyses. Methodology: Our focus was to increase the number of bootstrap replications that can be performed on large protein datasets using the maximum parsimony, distance matrix, and maximum likelihood methods. We have modified the PHYLIP package using MPI to enable large-scale phylogenetic study of protein sequences, using a statistically robust number of bootstrapped datasets, to be performed in a moderate amount of time. This paper discusses the methodology used to parallelize the PHYLIP programs and reports the performance of the parallel PHYLIP programs that are relevant to the study of protein evolution on several protein datasets. Conclusions: Calculations that currently take a few days on a state of the art desktop workstation are reduced to calculations that can be performed over lunchtime on a modern parallel computer. Of the three protein methods tested, the maximum likelihood method scales the best, followed by the distance method, and then the maximum parsimony method. However, the maximum likelihood method requires significant memory resources, which limits its application to mor

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central