99 research outputs found

    Bayesian phylogenetic modelling of lateral gene transfers

    Get PDF
    PhD ThesisPhylogenetic trees represent the evolutionary relationships between a set of species. Inferring these trees from data is particularly challenging sometimes since the transfer of genetic material can occur not only from parents to their o spring but also between organisms via lateral gene transfers (LGTs). Thus, the presence of LGTs means that genes in a genome can each have di erent evolutionary histories, represented by di erent gene trees. A few statistical approaches have been introduced to explore non-vertical evolution through collections of Markov-dependent gene trees. In 2005 Suchard described a Bayesian hierarchical model for joint inference of gene trees and an underlying species tree, where a layer in the model linked gene trees to the species tree via a sequence of unknown lateral gene transfers. In his model LGT was modeled via a random walk in the tree space derived from the subtree prune and regraft (SPR) operator on unrooted trees. However, the use of SPR moves to represent LGT in an unrooted tree is problematic, since the transference of DNA between two organisms implies the contemporaneity of both organisms and therefore it can allow unrealistic LGTs. This thesis describes a related hierarchical Bayesian phylogenetic model for reconstructing phylogenetic trees which imposes a temporal constraint on LGTs, namely that they can only occur between species which exist concurrently. This is achieved by taking into account possible time orderings of divergence events in trees, without explicitly modelling divergence times. An extended version of the SPR operator is introduced as a more adequate mechanism to represent the LGT e ect in a tree. The extended SPR operation respects the time ordering. It additionaly di ers from regular SPR as it maintains a 1-to-1 correspondence between points on the species tree and points on each gene tree. Each point on a gene tree represents the existence of a population containing that gene at some point in time. Hierarchical phylogenetic models were used in the reconstruction of each gene tree from its corresponding gene alignment, enabling the pooling of information across genes. In addition to Suchard's approach, we assume variation in the rate of evolution between di erent sites. The species tree is assumed to be xed. A Markov Chain Monte Carlo (MCMC) algorithm was developed to t the model in a Bayesian framework. A novel MCMC proposal mechanism for jointly proposing the gene tree topology and branch lengths, LGT distance and LGT history has been developed as well as a novel graphical tool to represent LGT history, the LGT Biplot. Our model was applied to simulated and experimental datasets. More speci cally we analysed LGT/reassortment presence in the evolution of 2009 Swine-Origin In uenza Type A virus. Future improvements of our model and algorithm should include joint inference of the species tree, improving the computational e ciency of the MCMC algorithm and better consideration of other factors that can cause discordance of gene trees and species trees such as gene loss

    Structure and evolution of the mouse pregnancy-specific glycoprotein (Psg) gene locus

    Get PDF
    BACKGROUND: The pregnancy-specific glycoprotein (Psg) genes encode proteins of unknown function, and are members of the carcinoembryonic antigen (Cea) gene family, which is a member of the immunoglobulin gene (Ig) superfamily. In rodents and primates, but not in artiodactyls (even-toed ungulates / hoofed mammals), there have been independent expansions of the Psg gene family, with all members expressed exclusively in placental trophoblast cells. For the mouse Psg genes, we sought to determine the genomic organisation of the locus, the expression profiles of the various family members, and the evolution of exon structure, to attempt to reconstruct the evolutionary history of this locus, and to determine whether expansion of the gene family has been driven by selection for increased gene dosage, or diversification of function. RESULTS: We collated the mouse Psg gene sequences currently in the public genome and expressed-sequence tag (EST) databases and used systematic BLAST searches to generate complete sequences for all known mouse Psg genes. We identified a novel family member, Psg31, which is similar to Psg30 but, uniquely amongst mouse Psg genes, has a duplicated N1 domain. We also identified a novel splice variant of Psg16 (bCEA). We show that Psg24 and Psg30 / Psg31 have independently undergone expansion of N-domain number. By mapping BAC, YAC and cosmid clones we described two clusters of Psg genes, which we linked and oriented using fluorescent in situ hybridisation (FISH). Comparison of our Psg locus map with the public mouse genome database indicates good agreement in overall structure and further elucidates gene order. Expression levels of Psg genes in placentas of different developmental stages revealed dramatic differences in the developmental expression profile of individual family members. CONCLUSION: We have combined existing information, and provide new information concerning the evolution of mouse Psg exon organization, the mouse Psg genomic locus structure, and the expression patterns of individual Psg genes. This information will facilitate functional studies of this complex gene family

    Mathematical Problems in Molecular Evolution and Next Generation Sequencing

    Get PDF
    The focus of this work is the development of new mathematical methods for problems in phylogenetic tree inferences. In the first part we solve several problems related to so-called partitioned alignments. In the second part we demonstrate how to calculate all identical subtrees of a given labeled tree. We make use of this to implement an efficient method for avoiding redundant likelihood operations during phylogenetic tree inferences

    UNSUPERVISED LEARNING IN PHYLOGENOMIC ANALYSIS OVER THE SPACE OF PHYLOGENETIC TREES

    Get PDF
    A phylogenetic tree is a tree to represent an evolutionary history between species or other entities. Phylogenomics is a new field intersecting phylogenetics and genomics and it is well-known that we need statistical learning methods to handle and analyze a large amount of data which can be generated relatively cheaply with new technologies. Based on the existing Markov models, we introduce a new method, CURatio, to identify outliers in a given gene data set. This method, intrinsically an unsupervised method, can find outliers from thousands or even more genes. This ability to analyze large amounts of genes (even with missing information) makes it unique in many parametric methods. At the same time, the exploration of statistical analysis in high-dimensional space of phylogenetic trees has never stopped, many tree metrics are proposed to statistical methodology. Tropical metric is one of them. We implement a MCMC sampling method to estimate the principal components in a tree space with the tropical metric for achieving dimension reduction and visualizing the result in a 2-D tropical triangle

    Transcriptome Sequences Resolve Deep Relationships of the Grape Family

    Get PDF
    Previous phylogenetic studies of the grape family (Vitaceae) yielded poorly resolved deep relationships, thus impeding our understanding of the evolution of the family. Next-generation sequencing now offers access to protein coding sequences very easily, quickly and cost-effectively. To improve upon earlier work, we extracted 417 orthologous single-copy nuclear genes from the transcriptomes of 15 species of the Vitaceae, covering its phylogenetic diversity. The resulting transcriptome phylogeny provides robust support for the deep relationships, showing the phylogenetic utility of transcriptome data for plants over a time scale at least since the mid-Cretaceous. The pros and cons of transcriptome data for phylogenetic inference in plants are also evaluated

    Phylogenomics and Historical Biogeography of the Gooseneck Barnacle Pollicipes elegans

    Get PDF
    This dissertation explores the systematics, biogeography, and genomics of the gooseneck barnacle Pollicipes elegans, a marine crustacean of the tropical Eastern Pacific. In Chapter 1, I provide a broad framework for my research by introducing and focusing on the long-­‐standing debate of the mechanisms behind the latitudinal gradient in species diversity, which provided the initial motivation for using Pollicipes elegans as a model system to study the mechanisms leading to genetic differentiation and speciation in tropical regions. In Chapter 2, I examine the genetic structure, infer patterns of connectivity across the warm tropical waters of the eastern Pacific, and reconstruct the biogeographic history of P. elegans using a statistical phylogeographic framework. Using mitochondrial DNA sequences, I found strong evidence supporting an out-­‐of-­‐the tropics model of speciation in P. elegans, with a clear phylogeographical break between populations in Mexico and all populations to the south. In Chapter 3, I added sequence data from six nuclear genes to the analysis of genetic structure and found strong evidence for two cryptic species within the nominal P. elegans that likely originated by allopatric speciation across the Central American Gap. I estimated the divergence times between peripheral and central populations, and the effective population sizes of these populations, and found again support for an out-­‐of-­‐the-­‐tropics model of diversification. In Chapter 4, I used RNA sequencing of individuals of P. elegans from each cryptic species to assemble the first transcriptome for this taxon. Data mining of the transcriptome allowed me to identify microsatellite and single nucleotide polymorphism (SNP) markers to be used in future research. Analyses using the SNP dataset revealed evidence for 11 genes under natural selection between the two cryptic species; the genes that were identified may be influenced by spatial variation in sea surface temperature in the tropical eastern Pacific. Lastly, in Chapter 5, I provide guidelines for future studies that should be pursued to help elucidate patterns, mechanisms, and consequences of latitudinal gradients of temperature in the process of allopatric speciation. The phylogeographic and demographic reconstruction for P. elegans in this dissertation provide evidence of the role that temperature may play in population differentiation associated with speciation. The transcriptome analyses provided a large set of genetic markers and a list of candidate genes under selection, a crucial first step in the description of the genetic basis of local thermal adaptation in tropical regions. The information generated in this dissertation provides a novel empirical system that can help elucidate the evolution of tropical diversity and can be used to potentially predict the future impacts of climate change on tropical species

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Phylogenetic Systematics of the Ulvophyceae (Chlorophyta) Based on Cladistic Analyses of Ribosomal RNA Genes and Morphology.

    Get PDF
    Cladistic analyses of non-molecular and nuclear-encoded rRNA sequence data provided the basis for hypotheses of relationships for the green algal class Ulvophyceae. Non-molecular data rooted with Chara support hypotheses which group the Chlorophyceae and Pleurastrophyceae with ulotrichalean and ulvalean Ulvophyceae. Analyses of rRNA sequence data group the siphonous and siphonocladous Ulvophyceae (i.e. Caulerpales, Siphonocladales, and Dasycladales) with the Chlorophyceae and Pleurastrophyceae. Although hypotheses supported by these independent data sets are incongruent, they suggest that the Ulvophyceae is not monophyletic. Based on rRNA sequences, pleurastrophycean taxa, which, like the Ulvophyceae, possess a counter-clockwise arrangement of flagellar basal bodies, are more closely related to the Chlorophyceae (which possess clockwise basal bodies) than to the Ulvophyceae. Thus, counter-clockwise basal body orientation does not diagnose a monophyletic group. Parsimony analyses to assess the strength of these hypotheses, including bootstrap, decay index, and character distributions suggest that basal divergences exhibit little character support and lead to ambiguous rooting of the phylogeny. Data randomization tests, however, clearly suggest that there is considerable signal in the data. Examination of ordinal relationships within the siphonous and siphonocladous Ulvophyceae revealed that the Dasycladales is the sister group to the Caulerpales with the Siphonocladales representing a basal lineage. Although inconsistent with hypotheses based on ultrastructural features, this hypothesis is consistent with recently reported fossil evidence that extended the minimum age of the siphonocladalean lineage to ca. 700 million years (concurrent with the oldest dasycladalean fossils). Relative rates of evolutionary divergence between sister taxa (inferred by comparing the number of nucleotide changes along internodes leading to terminal taxa) are higher in the Caulerpales and Dasycladales clade than in the Siphonocladales. Congruence of phylogenetic hypotheses with biogeographic distributions were also explored. Two lineages are identified in the Caulerpales; one with genera of strictly tropical distribution and another with more widespread taxa. The sister group, the Dasycladales, is also restricted to the tropics, suggesting that this is the primitive distribution pattern. The Siphonocladales exhibit a similar pattern: derived cosmopolitan clade and basal tropical genera. Thus, these data support the hypothesis that these algae originated in ancient tropical oceans
    corecore