47 research outputs found
Evaluating, Accelerating and Extending the Multispecies Coalescent Model of Evolution
So much research builds on evolutionary histories of species and
genes. They are used in genomics to infer synteny, in ecology to
describe and predict biodiversity, and in molecular biology to
transfer knowledge acquired in model organisms to humans and
crops. Beyond downstream applications, expanding our knowledge of
life on Earth is important in its own right. From Naturalis
Historia to On the Origin of Species, the acquisition of this
knowledge has been a part of human development.
Evolutionary histories are commonly represented as trees, where a
common ancestor progressively splits into descendant species or
alleles. Time trees add more information by using height to
represent genetic distance or elapsed time. Species and gene
trees can be inferred from molecular sequences using methods
which are explicitly model-based, or implicitly assume or are
statistically consistent with a particular model of evolution.
One such model, the multispecies coalescent (MSC), is the topic
of my thesis. Under this model, separate trees are inferred for
the species history and for each gene’s history. Gene trees are
embedded within the species tree according to a coalescent
process.
Researchers often avoid the MSC when reconstructing time trees
because of claims that available implementations are too
computationally demanding. Instead, the species history is
inferred using a single tree by concatenating the sequences from
each gene. I began my thesis research by evaluating the effect of
this approximation. In a realistic simulation based on parameters
inferred from empirical data, concatenation was grossly
inaccurate, especially when estimating recent species divergence
times. In a later simulation study I demonstrated that when using
concatenation, credible intervals often excluded the true
values.
To address reluctance towards using the MSC, I developed a faster
implementation of the model. StarBEAST2 is a Markov chain Monte
Carlo (MCMC) method, meaning it characterizes the probability
distribution over trees by randomly walking the parameter space.
I improved computational performance by developing more efficient
proposals used to traverse the space, and reducing the number of
parameters in the model through analytical integration of
population sizes.
Despite its sophistication, the MSC has theoretical limitations.
One is that the substitution rate is assumed to stay constant, or
uncorrelated between lineages of different genes. However
substitution rates do vary and are associated with species traits
like body size. I addressed this assumption in StarBEAST2 by
extending the MSC to estimate substitution rates for each
species. Another assumption is that genetic material cannot be
transferred horizontally, but a more general model called the
multispecies network coalescent (MSNC) permits introgression of
alleles across species boundaries. My collaborators and I have
developed and evaluated an MCMC implementation of the the MSNC.
My final thesis project was to combine the MSC with the
fossilized birth-death (FBD) process, which models how species
are fossilized and sampled through time. To demonstrate the
utility of the FBD-MSC model, I used it to reconstruct the
evolutionary history of Caninae (dogs and foxes) using fossil
data and molecular sequences
Diversification of the C-TERMINALLY ENCODED PEPTIDE (CEP) gene family in angiosperms, and evolution of plant-family specific CEP genes
BACKGROUND Small, secreted signaling peptides work in parallel with phytohormones to control important aspects of plant growth and development. Genes from the C-TERMINALLY ENCODED PEPTIDE (CEP) family produce such peptides which negatively regulate plant growth, especially under stress, and affect other important developmental processes. To illuminate how the CEP gene family has evolved within the plant kingdom, including its emergence, diversification and variation between lineages, a comprehensive survey was undertaken to identify and characterize CEP genes in 106 plant genomes. RESULTS Using a motif-based system developed for this study to identify canonical CEP peptide domains, a total of 916 CEP genes and 1,223 CEP domains were found in angiosperms and for the first time in gymnosperms. This defines a narrow band for the emergence of CEP genes in plants, from the divergence of lycophytes to the angiosperm/gymnosperm split. Both CEP genes and domains were found to have diversified in angiosperms, particularly in the Poaceae and Solanaceae plant families. Multispecies orthologous relationships were determined for 22% of identified CEP genes, and further analysis of those groups found selective constraints upon residues within the CEP peptide and within the previously little-characterized variable region. An examination of public Oryza sativa RNA-Seq datasets revealed an expression pattern that links OsCEP5 and OsCEP6 to panicle development and flowering, and CEP gene trees reveal these emerged from a duplication event associated with the Poaceae plant family. CONCLUSIONS The characterization of the plant-family specific CEP genes OsCEP5 and OsCEP6, the association of CEP genes with angiosperm-specific development processes like panicle development, and the diversification of CEP genes in angiosperms provides further support for the hypothesis that CEP genes have been integral to the evolution of novel traits within the angiosperm lineage. Beyond these findings, the comprehensive set of CEP genes and their properties reported here will be a resource for future research on CEP genes and peptides.We thank Jason Bragg for his input and advice on inferring gene trees. This work was supported by an Australian Research Council Discovery Project grant (DP120101893). HAO received financial support (UHS10488) to conduct this study from the Grains Research and Development Council
StarBEAST2 Brings Faster Species Tree Inference and Accurate Estimates of Substitution Rates
Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.This work was supported by a Rutherford Discovery
Fellowship awarded to A.J.D. by the Royal Society of New
Zealand. H.A.O. was supported by an Australian Laureate
Fellowship awarded to Craig Moritz by the Australian
Research Council (FL110100104)
CEP-CEPR1 signalling inhibits the sucrose-dependent enhancement of lateral root growth
Lateral root (LR) proliferation is a major determinant of soil nutrient uptake. How resource allocation controls the extent of LR growth remains unresolved. We used genetic, physiological, transcriptomic, and grafting approaches to define a role for C-TERMINALLY ENCODED PEPTIDE RECEPTOR 1 (CEPR1) in controlling sucrose-dependent LR growth. CEPR1 inhibited LR growth in response to applied sucrose, other metabolizable sugars, and elevated light intensity. Pathways through CEPR1 restricted LR growth by reducing LR meristem size and the length of mature LR cells. RNA-sequencing of wild-type (WT) and cepr1-1 roots with or without sucrose treatment revealed an intersection of CEP–CEPR1 signalling with the sucrose transcriptional response. Sucrose up-regulated several CEP genes, supporting a specific role for CEP–CEPR1 in the response to sucrose. Moreover, genes with basally perturbed expression in cepr1-1 overlap with WT sucrose-responsive genes significantly. We found that exogenous CEP inhibited LR growth via CEPR1 by reducing LR meristem size and mature cell length. This result is consistent with CEP–CEPR1 acting to curtail the extent of sucrose-dependent LR growth. Reciprocal grafting indicates that LR growth inhibition requires CEPR1 in both the roots and shoots. Our results reveal a new role for CEP–CEPR1 signalling in controlling LR growth in response to sucrose.An Australian Research Council grant to MAD (DP150104250) supported this work. KC was supported by an ANU PhD scholarship. MT was supported by an Australian Post Graduate award
Bayesian inference of species networks from multilocus sequence data
Reticulate species evolution, such as hybridization or introgression, is relatively common in nature. In the presence of reticulation, species relationships can be captured by a rooted phylogenetic network, and orthologous gene evolution can be modeled as bifurcating gene trees embedded in the species network. We present a Bayesian approach to jointly infer species networks and gene trees from multilocus sequence data. A novel birth-hybridization process is used as the prior for the species network, and we assume a multispecies network coalescent prior for the embedded gene trees. We verify the ability of our method to correctly sample from the posterior distribution, and thus to infer a species network, through simulations. To quantify the power of our method, we reanalyze two large data sets of genes from spruces and yeasts. For the three closely related spruces, we verify the previously suggested homoploid hybridization event in this clade; for the yeast data, we find extensive hybridization events. Our method is available within the BEAST 2 add-on SpeciesNetwork, and thus provides an extensible framework for Bayesian inference of reticulate evolution.This research was supported by the European Research Council
under the Seventh Framework Programme of the European
Commission (PhyPD: grant number 335529 to T.S.). C.Z.
acknowledges his salary as well as a visit covered by this grant
to the Centre for Computational Evolution, University of
Auckland, New Zealand in mid-2016. H.A.O. was supported
by an Australian Laureate Fellowship awarded to Craig Moritz
by the Australian Research Council (FL110100104)
Validation and description of two new north-western Australian rainbow skinks with multispecies coalescent methods and morphology
While methods for genetic species delimitation have noticeably improved in the last decade, this remains a work in progress. Ideally, model based approaches should be applied and considered jointly with other lines of evidence, primarily morphology and geography, in an integrative taxonomy framework. Deep phylogeographic divergences have been reported for several species of Carlia skinks, but only for some eastern taxa have species boundaries been formally tested. The present study does this and revises the taxonomy for two species from northern Australia, Carlia johnstonei and C. triacantha. We introduce an approach that is based on the recently published method StarBEAST2, which uses multilocus data to explore the support for alternative species delimitation hypotheses using Bayes Factors (BFD). We apply this method, jointly with two other multispecies coalescent methods, using an extensive (from 2,163 exons) data set along with measures of 11 morphological characters. We use this integrated approach to evaluate two new candidate species previously revealed in phylogeographic analyses of rainbow skinks (genus Carlia) in Western Australia. The results based on BFD StarBEAST2, BFD* SNAPP and BPP genetic delimitation, together with morphology, support each of the four recently identified Carlia lineages as separate species. The BFD StarBEAST2 approach yielded results highly congruent with those from BFD* SNAPP and BPP. This supports use of the robust multilocus multispecies coalescent StarBEAST2 method for species delimitation, which does not require a priori resolved species or gene trees. Compared to the situation in C. triacantha, morphological divergence was greater between the two lineages within Kimberley endemic C. johnstonei, which also had deeper divergent histories. This congruence supports recognition of two species within C. johnstonei. Nevertheless, the combined evidence also supports recognition of two taxa within the more widespread C. triacantha. With this work, we describe two new species, Carlia insularis sp. nov and Carlia isostriacantha sp. nov. in the northwest of Australia. This contributes to increasing recognition that this region of tropical Australia has a rich and unique fauna.This research was supported by grants from the Australian Biological Resources Study
to CM and Scott Keogh, and from the Australian Research Council to CM (ARC
FL110100104). ACAS is supported by the FCT grant SFRH/BD/88740/2012
Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data
Motivation: Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCI and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data.
Results: Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCI in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases.
Availability and implementation: Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.National Science Foundation | Ref. IIS-1812822National Science Foundation | Ref. IIS-210683
Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences
Bacterial viruses (bacteriophages) have a key role in shaping the development and functional outputs of host microbiomes. Although metagenomic approaches have greatly expanded our understanding of the prokaryotic virosphere, additional tools are required for the phage-oriented dissection of metagenomic data sets, and host-range affiliation of recovered sequences. Here we demonstrate the application of a genome signature-based approach to interrogate conventional whole-community metagenomes and access subliminal, phylogenetically targeted, phage sequences present within. We describe a portion of the biological dark matter extant in the human gut virome, and bring to light a population of potentially gut-specific Bacteroidales-like phage, poorly represented in existing virus like particle-derived viral metagenomes. These predominantly temperate phage were shown to encode functions of direct relevance to human health in the form of antibiotic resistance genes, and provided evidence for the existence of putative ‘viral-enterotypes’ among this fraction of the human gut virome
The History of Chromosomal Instability in Genome-Doubled Tumors
Tumors frequently display high chromosomal instability and contain multiple copies of genomic regions. Here, we describe Gain Route Identification and Timing In Cancer (GRITIC), a generic method for timing genomic gains leading to complex copy number states, using single-sample bulk whole-genome sequencing data. By applying GRITIC to 6,091 tumors, we found that non-parsimonious evolution is frequent in the formation of complex copy number states in genome-doubled tumors. We measured chromosomal instability before and after genome duplication in human tumors and found that late genome doubling was followed by an increase in the rate of copy number gain. Copy number gains often accumulate as punctuated bursts, commonly after genome doubling. We infer that genome duplications typically affect the landscape of copy number losses, while only minimally impacting copy number gains. In summary, GRITIC is a novel copy number gain timing framework that permits the analysis of copy number evolution in chromosomally unstable tumors. Significance: Complex genomic gains are associated with whole-genome duplications, which are frequent across tumors, span a large fraction of their genomes, and are linked to poorer outcomes. GRITIC infers when these gains occur during tumor development, which will help to identify the genetic events that drive tumor evolution. See related commentary by Taylor, p. 1766