2,774 research outputs found
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Probabilistic methods in the analysis of protein interaction networks
Imperial Users onl
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
Antigenic diversity is generated by distinct evolutionary mechanisms in African trypanosome species
Antigenic variation enables pathogens to avoid the host immune response by continual switching of surface proteins. The protozoan blood parasite Trypanosoma brucei causes human African trypanosomiasis ("sleeping sickness") across sub-Saharan Africa and is a model system for antigenic variation, surviving by periodically replacing a monolayer of variant surface glycoproteins (VSG) that covers its cell surface. We compared the genome of Trypanosoma brucei with two closely related parasites Trypanosoma congolense and Trypanosoma vivax, to reveal how the variant antigen repertoire has evolved and how it might affect contemporary antigenic diversity. We reconstruct VSG diversification showing that Trypanosoma congolense uses variant antigens derived from multiple ancestral VSG lineages, whereas in Trypanosoma brucei VSG have recent origins, and ancestral gene lineages have been repeatedly co-opted to novel functions. These historical differences are reflected in fundamental differences between species in the scale and mechanism of recombination. Using phylogenetic incompatibility as a metric for genetic exchange, we show that the frequency of recombination is comparable between Trypanosoma congolense and Trypanosoma brucei but is much lower in Trypanosoma vivax. Furthermore, in showing that the C-terminal domain of Trypanosoma brucei VSG plays a crucial role in facilitating exchange, we reveal substantial species differences in the mechanism of VSG diversification. Our results demonstrate how past VSG evolution indirectly determines the ability of contemporary parasites to generate novel variant antigens through recombination and suggest that the current model for antigenic variation in Trypanosoma brucei is only one means by which these parasites maintain chronic infections
Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments
Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes
and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage
display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative
approach for predicting PRM-mediated protein-protein interactions from sequence data. The
model suffered from over-fitting, so Laplacian regularisation was found to be important in
achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative
model. We also propose another discriminative model which can be applied to all sequences
present in the organism at a significantly lower computational cost. This is due to its additional
assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small
number of instances of each binding site motif. However, closely related species are expected
to share similar binding sites, which would be expected to be highly conserved. We investigated
rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic
tree can represent the relationships and divergences between the taxa. However, taxa sequences
exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites,
and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined
the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments:
one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried
out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
There and Back Again: Exploring the Roles of Models and Natural History in Macroevolution
Ecological diversity in nature is tremendously complex. Evolutionary biologists and ecologists have sought to understand this complexity using foundational concepts like ecological niches, guilds, and adaptive zones. The merger of these concepts with stochastic models and phylogenies helped create the field of phylogenetic comparative methods, which has made fundamental contributions to our understanding of the evolutionary history of life’s rich ecological variety and the role ecology plays in the diversification of species and phenotypes and the assembly of species-rich communities. Despite this progress, however, phylogenetic comparative methods have been slow to expand their data repertoire. There is a general rarity of comparative datasets that include primary natural history observations of organisms in nature and of comparative methods to work with such data. The main contribution of this dissertation is to address this shortfall. I do so in three main ways. First, in earlier chapters I study some simple stochastic models of ecological character state change, revealing unappreciated subtleties that complicate our ability to interpret their results in terms of historical events. Second, building off lessons learned from these early chapters, I develop a new method that uses primary natural history observations to jointly infer the phylogenetic distribution of ecological niche states for individual species and their unsampled ancestors. Third, to demonstrate the flexibility of the new method, I conduct an empirical analysis on the diversification of snake feeding habits using a new comprehensive database of observations of prey acquisition by snakes that I compiled. Taken together, the research in this dissertation demonstrates how fundamental observations of organisms in nature can be used to make quantitative inferences about the macroevolution of complex ecological traits and suggests new ways of integrating natural history data into comparative biology.PHDEcology and Evolutionary BiologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163161/1/mgru_1.pd
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
Spatial Guilds in the Serengeti Food Web Revealed by a Bayesian Group Model
Food webs, networks of feeding relationships among organisms, provide
fundamental insights into mechanisms that determine ecosystem stability and
persistence. Despite long-standing interest in the compartmental structure of
food webs, past network analyses of food webs have been constrained by a
standard definition of compartments, or modules, that requires many links
within compartments and few links between them. Empirical analyses have been
further limited by low-resolution data for primary producers. In this paper, we
present a Bayesian computational method for identifying group structure in food
webs using a flexible definition of a group that can describe both functional
roles and standard compartments. The Serengeti ecosystem provides an
opportunity to examine structure in a newly compiled food web that includes
species-level resolution among plants, allowing us to address whether groups in
the food web correspond to tightly-connected compartments or functional groups,
and whether network structure reflects spatial or trophic organization, or a
combination of the two. We have compiled the major mammalian and plant
components of the Serengeti food web from published literature, and we infer
its group structure using our method. We find that network structure
corresponds to spatially distinct plant groups coupled at higher trophic levels
by groups of herbivores, which are in turn coupled by carnivore groups. Thus
the group structure of the Serengeti web represents a mixture of trophic guild
structure and spatial patterns, in contrast to the standard compartments
typically identified in ecological networks. From data consisting only of nodes
and links, the group structure that emerges supports recent ideas on spatial
coupling and energy channels in ecosystems that have been proposed as important
for persistence.Comment: 28 pages, 6 figures (+ 3 supporting), 2 tables (+ 4 supporting
- …