349 research outputs found
Towards a faster and accurate supertree inference
Phylogenetic inference is one of the most challenging and important problems in computational biology. However, computing evolutionary links on data sets containing only few thousands of taxa easily becomes a daunting task. Moreover, recent advances in next-generation sequencing technologies are turning this problem even much harder, either in terms of complexity or scale. Therefore, phylogenetic inference requires new algorithms and methods to handle the unprecedented growth of biological data. In this paper, we identify several types of parallelism that are available while refining a supertree. We also present four improvements that we made to SuperFine-a state-of-The-Art supertree (meta)method-, which add support: i) to use FastTree as the inference tool; ii) to use a parallel version of FastTree, or RAxML, as the inference tool; iii) to exploit intra-polytomy parallelism within the so-called polytomy refinement phase; and iv) to exploit, at the same time, inter-polytomy and intra-polytomy parallelism within the polytomy refinement phase. Together, these improvements allow an efficient and transparent exploitation of hybrid-polytomy parallelism. Additionally, we pinpoint how future contributions should enhance the performance of such applications. Our studies show groundbreaking results in terms of the achieved speedups, specially when using biological data sets. Moreover, we show that the new parallel strategy-which exploits the hybrid-polytomy parallelism within the polytomy refinement phase-exhibits good scalability, even in the presence of asymmetric sets of tasks. Furthermore, the achieved results show that the radical improvement in performance does not impair tree accuracy, which is a key issue in phylogenetic inferences.This research was partially supported by Fundação para a Ciência e aTecnologia (grant SFRH/BD/42634/2007). We thank Rui Gonc¸alves, Rui Silva, and Tandy Warnow for fruitful discussions and valuable feedback. We thank Keshav Pingali for his valuable support and sponsorship to let us execute jobs on TACC machines. We are deeply grateful to Rui Oliveira, without whom it would not be possible to present this work. We are very grateful to the anonymous reviewers for the evaluation of our paper and for the constructive critics.info:eu-repo/semantics/publishedVersio
Towards a faster and accurate supertree inference
Phylogenetic inference is one of the most challenging and important problems in computational biology. However, computing evolutionary links on data sets containing only few thousands of taxa easily becomes a daunting task. Moreover, recent advances in next-generation sequencing technologies are turning this problem even much harder, either in terms of complexity or scale. Therefore, phylogenetic inference requires new algorithms and methods to handle the unprecedented growth of biological data. In this paper, we identify several types of parallelism that are available while refining a supertree. We also present four improvements that we made to SuperFine-a state-of-The-Art supertree (meta)method-, which add support: i) to use FastTree as the inference tool; ii) to use a parallel version of FastTree, or RAxML, as the inference tool; iii) to exploit intra-polytomy parallelism within the so-called polytomy refinement phase; and iv) to exploit, at the same time, inter-polytomy and intra-polytomy parallelism within the polytomy refinement phase. Together, these improvements allow an efficient and transparent exploitation of hybrid-polytomy parallelism. Additionally, we pinpoint how future contributions should enhance the performance of such applications. Our studies show groundbreaking results in terms of the achieved speedups, specially when using biological data sets. Moreover, we show that the new parallel strategy-which exploits the hybrid-polytomy parallelism within the polytomy refinement phase-exhibits good scalability, even in the presence of asymmetric sets of tasks. Furthermore, the achieved results show that the radical improvement in performance does not impair tree accuracy, which is a key issue in phylogenetic inferences.This research was partially supported by Fundação para a Ciência e aTecnologia (grant SFRH/BD/42634/2007). We thank Rui Gonc¸alves, Rui Silva, and Tandy Warnow for fruitful discussions and valuable feedback. We thank Keshav Pingali for his valuable support and sponsorship to let us execute jobs on TACC machines. We are deeply grateful to Rui Oliveira, without whom it would not be possible to present this work. We are very grateful to the anonymous reviewers for the evaluation of our paper and for the constructive critics.info:eu-repo/semantics/publishedVersio
Supertree-like methods for genome-scale species tree estimation
A critical step in many biological studies is the estimation of evolutionary trees (phylogenies) from genomic data. Of particular interest is the species tree, which illustrates how a set of species evolved from a common ancestor. While species trees were previously estimated from a few regions of the genome (genes), it is now widely recognized that biological processes can cause the evolutionary histories of individual genes to differ from each other and from the species tree. This heterogeneity across the genome is phylogenetic signal that can be leveraged to estimate species evolution with greater accuracy. Hence, species tree estimation is expected to be greatly aided by current large-scale sequencing efforts, including the 5000 Insect Genomes Project, the 10000 Plant Genomes Project, the (~60000) Vertebrate Genomes Project, and the Earth BioGenome Project, which aims to assemble genomes (or at least genome-scale data) for 1.5 million eukaryotic species in the next ten years. To analyze these forthcoming datasets, species tree estimation methods must scale to thousands of species and tens of thousands of genes; however, many of the current leading methods, which are heuristics for NP-hard optimization problems, can be prohibitively expensive on datasets of this size. In this dissertation, we argue that new methods are needed to enable scalable and statistically rigorous species tree estimation pipelines; we then seek to address this challenge through the introduction of three supertree-like methods: NJMerge, TreeMerge, and FastMulRFS. For these methods, we present theoretical results (worst-case running time analyses and proofs of statistical consistency) as well as empirical results on simulated datasets (and a fungal dataset for FastMulRFS). Overall, these methods enable statistically consistent species tree estimation pipelines that achieve comparable accuracy to the dominant optimization-based approaches while dramatically reducing running time
Developing and applying supertree methods in Phylogenomics and Macroevolution
Supertrees
can
be
used
to
combine
partially
overalapping
trees
and
generate
more
inclusive
phylogenies.
It
has
been
proposed
that
Maximum
Likelihood
(ML)
supertrees
method
(SM)
could
be
developed
using
an
exponential
probability
distribution
to
model
errors
in
the
input
trees
(given
a
proposed
supertree).
When
the
tree-‐to-‐tree
distances
used
in
the
ML
computation
are
symmetric
differences,
the
ML
SM
has
been
shown
to
be
equivalent
to
a
Majority-‐Rule
consensus
SM,
and
hence,
exactly
as
the
latter,
it
has
the
desirable
property
of
being
a
median
tree
(with
reference
to
the
set
of
input
trees).
The
ability
to
estimate
the
likelihood
of
supertrees,
allows
implementing
Bayesian
(MCMC)
approaches,
which
have
the
advantage
to
allow
the
support
for
the
clades
in
a
supertree
to
be
properly
estimated.
I
present
here
the
L.U.St
software
package;
it
contains
the
first
implementation
of
a
ML
SM
and
allows
for
the
first
time
statistical
tests
on
supertrees.
I
also
characterized
the
first
implementation
of
the
Bayesian
(MCMC)
SM.
Both
the
ML
and
the
Bayesian
(MCMC)
SMs
have
been
tested
for
and
found
to
be
immune
to
biases.
The
Bayesian
(MCMC)
SM
is
applied
to
the
reanalyses
of
a
variety
of
datasets
(i.e.
the
datasets
for
the
Metazoa
and
the
Carnivora),
and
I
have
also
recovered
the
first
Bayesian
supertree-‐based
phylogeny
of
the
Eubacteria
and
the
Archaebacteria.
These
new
SMs
are
discussed,
with
reference
to
other,
well-‐
known
SMs
like
Matrix
Representation
with
Parsimony.
Both
the
ML
and
Bayesian
SM
offer
multiple
attractive
advantages
over
current
alternatives
Developing and applying supertree methods in Phylogenomics and Macroevolution
Supertrees
can
be
used
to
combine
partially
overalapping
trees
and
generate
more
inclusive
phylogenies.
It
has
been
proposed
that
Maximum
Likelihood
(ML)
supertrees
method
(SM)
could
be
developed
using
an
exponential
probability
distribution
to
model
errors
in
the
input
trees
(given
a
proposed
supertree).
When
the
tree-‐to-‐tree
distances
used
in
the
ML
computation
are
symmetric
differences,
the
ML
SM
has
been
shown
to
be
equivalent
to
a
Majority-‐Rule
consensus
SM,
and
hence,
exactly
as
the
latter,
it
has
the
desirable
property
of
being
a
median
tree
(with
reference
to
the
set
of
input
trees).
The
ability
to
estimate
the
likelihood
of
supertrees,
allows
implementing
Bayesian
(MCMC)
approaches,
which
have
the
advantage
to
allow
the
support
for
the
clades
in
a
supertree
to
be
properly
estimated.
I
present
here
the
L.U.St
software
package;
it
contains
the
first
implementation
of
a
ML
SM
and
allows
for
the
first
time
statistical
tests
on
supertrees.
I
also
characterized
the
first
implementation
of
the
Bayesian
(MCMC)
SM.
Both
the
ML
and
the
Bayesian
(MCMC)
SMs
have
been
tested
for
and
found
to
be
immune
to
biases.
The
Bayesian
(MCMC)
SM
is
applied
to
the
reanalyses
of
a
variety
of
datasets
(i.e.
the
datasets
for
the
Metazoa
and
the
Carnivora),
and
I
have
also
recovered
the
first
Bayesian
supertree-‐based
phylogeny
of
the
Eubacteria
and
the
Archaebacteria.
These
new
SMs
are
discussed,
with
reference
to
other,
well-‐
known
SMs
like
Matrix
Representation
with
Parsimony.
Both
the
ML
and
Bayesian
SM
offer
multiple
attractive
advantages
over
current
alternatives
Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches
Abstract Background Biology has increasingly recognized the necessity to build and utilize larger phylogenies to address broad evolutionary questions. Large phylogenies have facilitated the discovery of differential rates of molecular evolution between trees and herbs. They have helped us understand the diversification patterns of mammals as well as the patterns of seed evolution. In addition to these broad evolutionary questions there is increasing awareness of the importance of large phylogenies for addressing conservation issues such as biodiversity hotspots and response to global change. Two major classes of methods have been employed to accomplish the large tree-building task: supertrees and supermatrices. Although these methods are continually being developed, they have yet to be made fully accessible to comparative biologists making extremely large trees rare. Results Here we describe and demonstrate a modified supermatrix method termed mega-phylogeny that uses databased sequences as well as taxonomic hierarchies to make extremely large trees with denser matrices than supermatrices. The two major challenges facing large-scale supermatrix phylogenetics are assembling large data matrices from databases and reconstructing trees from those datasets. The mega-phylogeny approach addresses the former as the latter is accomplished by employing recently developed methods that have greatly reduced the run time of large phylogeny construction. We present an algorithm that requires relatively little human intervention. The implemented algorithm is demonstrated with a dataset and phylogeny for Asterales (within Campanulidae) containing 4954 species and 12,033 sites and an rbcL matrix for green plants (Viridiplantae) with 13,533 species and 1,401 sites. Conclusion By examining much larger phylogenies, patterns emerge that were otherwise unseen. The phylogeny of Viridiplantae successfully reconstructs major relationships of vascular plants that previously required many more genes. These demonstrations underscore the importance of using large phylogenies to uncover important evolutionary patterns and we present a fast and simple method for constructing these phylogenies.</p
- …