2,774 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    Antigenic diversity is generated by distinct evolutionary mechanisms in African trypanosome species

    Get PDF
    Antigenic variation enables pathogens to avoid the host immune response by continual switching of surface proteins. The protozoan blood parasite Trypanosoma brucei causes human African trypanosomiasis ("sleeping sickness") across sub-Saharan Africa and is a model system for antigenic variation, surviving by periodically replacing a monolayer of variant surface glycoproteins (VSG) that covers its cell surface. We compared the genome of Trypanosoma brucei with two closely related parasites Trypanosoma congolense and Trypanosoma vivax, to reveal how the variant antigen repertoire has evolved and how it might affect contemporary antigenic diversity. We reconstruct VSG diversification showing that Trypanosoma congolense uses variant antigens derived from multiple ancestral VSG lineages, whereas in Trypanosoma brucei VSG have recent origins, and ancestral gene lineages have been repeatedly co-opted to novel functions. These historical differences are reflected in fundamental differences between species in the scale and mechanism of recombination. Using phylogenetic incompatibility as a metric for genetic exchange, we show that the frequency of recombination is comparable between Trypanosoma congolense and Trypanosoma brucei but is much lower in Trypanosoma vivax. Furthermore, in showing that the C-terminal domain of Trypanosoma brucei VSG plays a crucial role in facilitating exchange, we reveal substantial species differences in the mechanism of VSG diversification. Our results demonstrate how past VSG evolution indirectly determines the ability of contemporary parasites to generate novel variant antigens through recombination and suggest that the current model for antigenic variation in Trypanosoma brucei is only one means by which these parasites maintain chronic infections

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    There and Back Again: Exploring the Roles of Models and Natural History in Macroevolution

    Full text link
    Ecological diversity in nature is tremendously complex. Evolutionary biologists and ecologists have sought to understand this complexity using foundational concepts like ecological niches, guilds, and adaptive zones. The merger of these concepts with stochastic models and phylogenies helped create the field of phylogenetic comparative methods, which has made fundamental contributions to our understanding of the evolutionary history of life’s rich ecological variety and the role ecology plays in the diversification of species and phenotypes and the assembly of species-rich communities. Despite this progress, however, phylogenetic comparative methods have been slow to expand their data repertoire. There is a general rarity of comparative datasets that include primary natural history observations of organisms in nature and of comparative methods to work with such data. The main contribution of this dissertation is to address this shortfall. I do so in three main ways. First, in earlier chapters I study some simple stochastic models of ecological character state change, revealing unappreciated subtleties that complicate our ability to interpret their results in terms of historical events. Second, building off lessons learned from these early chapters, I develop a new method that uses primary natural history observations to jointly infer the phylogenetic distribution of ecological niche states for individual species and their unsampled ancestors. Third, to demonstrate the flexibility of the new method, I conduct an empirical analysis on the diversification of snake feeding habits using a new comprehensive database of observations of prey acquisition by snakes that I compiled. Taken together, the research in this dissertation demonstrates how fundamental observations of organisms in nature can be used to make quantitative inferences about the macroevolution of complex ecological traits and suggests new ways of integrating natural history data into comparative biology.PHDEcology and Evolutionary BiologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163161/1/mgru_1.pd

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Spatial Guilds in the Serengeti Food Web Revealed by a Bayesian Group Model

    Get PDF
    Food webs, networks of feeding relationships among organisms, provide fundamental insights into mechanisms that determine ecosystem stability and persistence. Despite long-standing interest in the compartmental structure of food webs, past network analyses of food webs have been constrained by a standard definition of compartments, or modules, that requires many links within compartments and few links between them. Empirical analyses have been further limited by low-resolution data for primary producers. In this paper, we present a Bayesian computational method for identifying group structure in food webs using a flexible definition of a group that can describe both functional roles and standard compartments. The Serengeti ecosystem provides an opportunity to examine structure in a newly compiled food web that includes species-level resolution among plants, allowing us to address whether groups in the food web correspond to tightly-connected compartments or functional groups, and whether network structure reflects spatial or trophic organization, or a combination of the two. We have compiled the major mammalian and plant components of the Serengeti food web from published literature, and we infer its group structure using our method. We find that network structure corresponds to spatially distinct plant groups coupled at higher trophic levels by groups of herbivores, which are in turn coupled by carnivore groups. Thus the group structure of the Serengeti web represents a mixture of trophic guild structure and spatial patterns, in contrast to the standard compartments typically identified in ecological networks. From data consisting only of nodes and links, the group structure that emerges supports recent ideas on spatial coupling and energy channels in ecosystems that have been proposed as important for persistence.Comment: 28 pages, 6 figures (+ 3 supporting), 2 tables (+ 4 supporting
    • …
    corecore