23 research outputs found
Estimating phylogenetic trees from genome-scale data
As researchers collect increasingly large molecular data sets to reconstruct
the Tree of Life, the heterogeneity of signals in the genomes of diverse
organisms poses challenges for traditional phylogenetic analysis. A class of
phylogenetic methods known as "species tree methods" have been proposed to
directly address one important source of gene tree heterogeneity, namely the
incomplete lineage sorting or deep coalescence that occurs when evolving
lineages radiate rapidly, resulting in a diversity of gene trees from a single
underlying species tree. Although such methods are gaining in popularity, they
are being adopted with caution in some quarters, in part because of an
increasing number of examples of strong phylogenetic conflict between
concatenation or supermatrix methods and species tree methods. Here we review
theory and empirical examples that help clarify these conflicts. Thinking of
concatenation as a special case of the more general model provided by the
multispecies coalescent can help explain a number of differences in the
behavior of the two methods on phylogenomic data sets. Recent work suggests
that species tree methods are more robust than concatenation approaches to some
of the classic challenges of phylogenetic analysis, including rapidly evolving
sites in DNA sequences, base compositional heterogeneity and long branch
attraction. We show that approaches such as binning, designed to augment the
signal in species tree analyses, can distort the distribution of gene trees and
are inconsistent. Computationally efficient species tree methods that
incorporate biological realism are a key to phylogenetic analysis of whole
genome data.Comment: 39 pages, 3 figure
Recommended from our members
Estimating phylogenetic trees from genome-scale data
The heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. Phylogenetic methods known as “species tree” methods have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a single underlying species tree. Here we review theory and empirical examples that help clarify conflicts between species tree and concatenation methods, and misconceptions in the literature about the performance of species tree methods. Considering concatenation as a special case of the multispecies coalescent model helps explain differences in the behavior of the two methods on phylogenomic data sets. Recent work suggests that species tree methods are more robust than concatenation approaches to some of the classic challenges of phylogenetic analysis, including rapidly evolving sites in DNA sequences and long-branch attraction. We show that approaches, such as binning, designed to augment the signal in species tree analyses can distort the distribution of gene trees and are inconsistent. Computationally efficient species tree methods incorporating biological realism are a key to phylogenetic analysis of whole-genome data.Organismic and Evolutionary Biolog
Weighted Statistical Binning: enabling statistically consistent genome-scale phylogenetic analyses
Because biological processes can make different loci have different
evolutionary histories, species tree estimation requires multiple loci from
across the genome. While many processes can result in discord between gene
trees and species trees, incomplete lineage sorting (ILS), modeled by the
multi-species coalescent, is considered to be a dominant cause for gene tree
heterogeneity. Coalescent-based methods have been developed to estimate species
trees, many of which operate by combining estimated gene trees, and so are
called summary methods. Because summary methods are generally fast, they have
become very popular techniques for estimating species trees from multiple loci.
However, recent studies have established that summary methods can have reduced
accuracy in the presence of gene tree estimation error, and also that many
biological datasets have substantial gene tree estimation error, so that
summary methods may not be highly accurate on biologically realistic
conditions. Mirarab et al. (Science 2014) presented the statistical binning
technique to improve gene tree estimation in multi-locus analyses, and showed
that it improved the accuracy of MP-EST, one of the most popular
coalescent-based summary methods. Statistical binning, which uses a simple
statistical test for combinability and then uses the larger sets of genes to
re-calculate gene trees, has good empirical performance, but using statistical
binning within a phylogenomics pipeline does not have the desirable property of
being statistically consistent. We show that weighting the recalculated gene
trees by the bin sizes makes statistical binning statistically consistent under
the multispecies coalescent, and maintains the good empirical performance.
Thus, "weighted statistical binning" enables highly accurate genome-scale
species tree estimation, and is also statistical consistent under the
multi-species coalescent model.Comment: (1) In Press, PLoS ON
The evolution of complex calls In meadow Katydids
Meadow Katydids (genera Orchelimum and Conocephalus) are a specious group often are found in habitats with several species within the group living in sympatry. They produce complex calls with two distinct phrases, "buzzing" and "ticking". These two phrases are organized in a highly diverse way across species. This diversity of call patterns in Meadow Katydids provides an excellent opportunity to comparatively study the evolution of complex calls. We tested the function of the two call phrases in male-male interactions. we examined the structure of the male call in the context of communities to identify candidate traits (i.e. traits likely involved in reproductive isolation). We constructed a molecular phylogeny from twenty species of Meadow Katydids, and examined the phylogenetic signal within call traits. The results of all of this taken together suggests ticking evolved in the context of male-male interaction, buzzing has been important for diversification, and in some species females have co-opted the tick to also function in reproductive isolation. Importantly, we have also designed and field-tested a plan to use Meadow Katydids as tools in primary, secondary, and post-secondary classrooms/laboratoriesIncludes bibliographical reference
How challenging RADseq data turned out to favor coalescent-based species tree inference. A case study in Aichryson (Crassulaceae)
Analysing multiple genomic regions while incorporating detection and qualification of discordance among regions has become standard for understanding phylogenetic relationships. In plants, which usually have comparatively large genomes, this is feasible by the combination of reduced-representation library (RRL) methods and high-throughput sequencing enabling the cost effective acquisition of genomic data for thousands of loci from hundreds of samples. One popular RRL method is RADseq. A major disadvantage of established RADseq approaches is the rather short fragment and sequencing range, leading to loci of little individual phylogenetic information. This issue hampers the application of coalescent-based species tree inference. The modified RADseq protocol presented here targets ca. 5,000 loci of 300-600nt length, sequenced with the latest short-read-sequencing (SRS) technology, has the potential to overcome this drawback. To illustrate the advantages of this approach we use the study group Aichryson Webb & Berthelott (Crassulaceae), a plant genus that diversified on the Canary Islands. The data analysis approach used here aims at a careful quality control of the long loci dataset. It involves an informed selection of thresholds for accurate clustering, a thorough exploration of locus properties, such as locus length, coverage and variability, to identify potential biased data and a comparative phylogenetic inference of filtered datasets, accompanied by an evaluation of resulting BS support, gene and site concordance factor values, to improve overall resolution of the resulting phylogenetic trees. The final dataset contains variable loci with an average length of 373nt and facilitates species tree estimation using a coalescent-based summary approach. Additional improvements brought by the approach are critically discussed