9 research outputs found
Using Phylogenomic Data to Untangle the Patterns and Timescale of Flowering Plant Evolution
Angiosperms are one of the most dominant groups on Earth, and have fundamentally changed global ecosystem patterns and function. Therefore, unravelling their evolutionary history is key to understanding how the world around us was formed, and how it might change in the future. In this thesis, I use genome-scale data to investigate the evolutionary patterns and timescale of angiosperms at multiple taxonomic levels, ranging from angiosperm-wide to genus-level data sets. I begin by using the largest combination of taxon and gene sampling thus far to provide a novel estimate for the timing of angiosperm origin in the Triassic period. Through a range of sensitivity analyses, I demonstrate that this estimate is robust to many important components of Bayesian molecular dating. I then explore tactics for phylogenomic dating using multiple molecular clocks. I evaluate methods for estimating the number and assignment of molecular clock models, and strategies for partitioning molecular clock models in analyses of multigene data sets. I also demonstrate the importance of critically evaluating the precision in age estimates from molecular dating analyses. Finally, I assess the utility of plastid data sets for resolving challenging phylogenetic relationships, focusing on Pimelea Banks & Sol. ex Gaertn. Through analysis of a multigene data set, sampled from many taxa, I provide an improved phylogeny for Pimelea and its close relatives. I then generate a plastome-scale data set for a representative sample of species to further refine the Pimelea phylogeny, and characterise discordant phylogenetic signals within their chloroplast genomes. The work in this thesis demonstrates the power of genome- scale data to address challenging phylogenetic questions, and the importance of critical evaluation of both methods and results. Future progress in our understanding of angiosperm evolution will depend on broader and denser taxon sampling, and the development of improved phylogenetic methods
Algorithms for Analysis of Heterogeneous Cancer and Viral Populations Using High-Throughput Sequencing Data
Next-generation sequencing (NGS) technologies experienced giant leaps in recent years. Short read samples reach millions of reads, and the number of samples has been growing enormously in the wake of the COVID-19 pandemic. This data can expose essential aspects of disease transmission and development and reveal the key to its treatment. At the same time, single-cell sequencing saw the progress of getting from dozens to tens of thousands of cells per sample. These technological advances bring new challenges for computational biology and require the development of scalable, robust methods to deal with a wide range of problems varying from epidemiology to cancer studies.
The first part of this work is focused on processing virus NGS data. It proposes algorithms that can facilitate the initial data analysis steps by filtering genetically related sequencing and the tool investigating intra-host virus diversity vital for biomedical research and epidemiology.
The second part addresses single-cell data in cancer studies. It develops evolutionary cancer models involving new quantitative parameters of cancer subclones to understand the underlying processes of cancer development better
Investigating Evolutionary History Using Phylogenomics
Reconstructing the Tree of Life is one of the principal aims of evolutionary biology. The development of molecular phylogenetics to elucidate evolutionary history has complemented palaeontology, biogeography, and archaeology in elucidating biological history. The development of molecular-clock analyses allowed evolutionary timescales to be estimated using nucleotide sequences and other products of the evolutionary process Until recently, the twin challenges of molecular dating were in obtaining sufficient data and developing robust methods. The former concern is now less important as high–throughput sequencing technology allows entire genomes to be sampled. Genome–scale data enhances statistical power, but accompanying this wealth of data is a new suite of analytical challenges. One of these key challenges is analysing these data in synthesis with the paleontological record without statistical overparameterisation. There are also aspects of the evolutionary process, such as among–lineage rate variation, that can affect the precision and accuracy of current methods. In this thesis, I first use the richest nucleotide sequence data set of insects available to estimate an authoritative insect evolutionary timescale that dates the origins and diversification of every major insect order. I then focus on molecular-clock methods by testing their performance in inferring evolutionary rates from time–structured data, common in the study of ancient DNA. I find that among–rate lineage variation and phylo–temporal clustering affect rate estimates. I also study data partitioning, a common technique used to optimise the analysis of multilocus data where independent parameters are applied across different subsets of the data. New data from the genomic revolution gifts biologists new opportunities to re-examine enduring questions about the evolutionary process. Here, I use phylogenetic tools to show that evolution leaves figurative fingerprints on genomes over millions of years
Recommended from our members
Probabilistic Reconstruction and Comparative Systems Biology of Microbial Metabolism
With the number of sequenced microbial species soon to be in the tens of thousands, we are in a unique position to investigate microbial function, ecology, and evolution on a large scale. In this dissertation I first describe the use of hundreds of in silico models of bacterial metabolic networks to study the long-term the evolution of growth and gene-essentiality phenotypes.
The results show that, over billions of years of evolution, the conservation of bacterial phenotypic properties drops by a similar fraction per unit time following an exponential decay. The analysis provides a framework to generate and test hypotheses related to the phenotypic evolution of different microbial groups and for comparative analyses based on phenotypic properties of species. Mapping of genome sequences to phenotypic predictions -such as used in the analysis just described- critically relies on accurate functional annotations.
In this context, I next describe GLOBUS, a probabilistic method for genome-wide biochemical annotations. GLOBUS uses Gibbs sampling to calculate probabilities for each possible assignment of genes to metabolic functions based on sequence information and both local and global genomic context data. Several important functional predictions made by GLOBUS were experimentally validated in Bacillus subtilis and hundreds more were obtained across other species. Complementary to the automated annotation method, I also describe the manual reconstruction and constraints-based analysis of the metabolic network of the malaria parasite Plasmodium falciparum. After careful reconciliation of the model with available biochemical and phenotypic data, the high-quality reconstruction allowed the prediction and in vivo validation of a novel potential antimalarial target. The model was also used to contextualize different types of genome-scale data such as gene expression and metabolomics measurements.
Finally, I present two projects related to population genetics aspects of sequence and genome evolution. The first project addresses the question of why highly expressed proteins evolve slowly, showing that, at least for Escherichia coli, this is more likely to be a consequence of selection for translational efficiency than selection to avoid misfolded protein toxicity. The second project investigates genetic robustness mediated by gene duplicates in the context of large natural microbial populations. The analysis shows that, under these conditions, the ability of duplicated yeast genes to effectively compensate for the loss of their paralogs is not a monotonic function of their sequence divergence
Universal pacemaker of genome evolution.
A fundamental observation of comparative genomics is that the distribution of evolution rates across the complete sets of orthologous genes in pairs of related genomes remains virtually unchanged throughout the evolution of life, from bacteria to mammals. The most straightforward explanation for the conservation of this distribution appears to be that the relative evolution rates of all genes remain nearly constant, or in other words, that evolutionary rates of different genes are strongly correlated within each evolving genome. This correlation could be explained by a model that we denoted Universal PaceMaker (UPM) of genome evolution. The UPM model posits that the rate of evolution changes synchronously across genome-wide sets of genes in all evolving lineages. Alternatively, however, the correlation between the evolutionary rates of genes could be a simple consequence of molecular clock (MC). We sought to differentiate between the MC and UPM models by fitting thousands of phylogenetic trees for bacterial and archaeal genes to supertrees that reflect the dominant trend of vertical descent in the evolution of archaea and bacteria and that were constrained according to the two models. The goodness of fit for the UPM model was better than the fit for the MC model, with overwhelming statistical significance, although similarly to the MC, the UPM is strongly overdispersed. Thus, the results of this analysis reveal a universal, genome-wide pacemaker of evolution that could have been in operation throughout the history of life