9,971 research outputs found
Evolutionary models for insertions and deletions in a probabilistic modeling framework
BACKGROUND: Probabilistic models for sequence comparison (such as hidden Markov models and pair hidden Markov models for proteins and mRNAs, or their context-free grammar counterparts for structural RNAs) often assume a fixed degree of divergence. Ideally we would like these models to be conditional on evolutionary divergence time. Probabilistic models of substitution events are well established, but there has not been a completely satisfactory theoretical framework for modeling insertion and deletion events. RESULTS: I have developed a method for extending standard Markov substitution models to include gap characters, and another method for the evolution of state transition probabilities in a probabilistic model. These methods use instantaneous rate matrices in a way that is more general than those used for substitution processes, and are sufficient to provide time-dependent models for standard linear and affine gap penalties, respectively. Given a probabilistic model, we can make all of its emission probabilities (including gap characters) and all its transition probabilities conditional on a chosen divergence time. To do this, we only need to know the parameters of the model at one particular divergence time instance, as well as the parameters of the model at the two extremes of zero and infinite divergence. I have implemented these methods in a new generation of the RNA genefinder QRNA (eQRNA). CONCLUSION: These methods can be applied to incorporate evolutionary models of insertions and deletions into any hidden Markov model or stochastic context-free grammar, in a pair or profile form, for sequence modeling
An Alternative Model of Amino Acid Replacement
The observed correlations between pairs of homologous protein sequences are
typically explained in terms of a Markovian dynamic of amino acid substitution.
This model assumes that every location on the protein sequence has the same
background distribution of amino acids, an assumption that is incompatible with
the observed heterogeneity of protein amino acid profiles and with the success
of profile multiple sequence alignment. We propose an alternative model of
amino acid replacement during protein evolution based upon the assumption that
the variation of the amino acid background distribution from one residue to the
next is sufficient to explain the observed sequence correlations of homologs.
The resulting dynamical model of independent replacements drawn from
heterogeneous backgrounds is simple and consistent, and provides a unified
homology match score for sequence-sequence, sequence-profile and
profile-profile alignment.Comment: Minor improvements. Added figure and reference
The identifiability of tree topology for phylogenetic models, including covarion and mixture models
For a model of molecular evolution to be useful for phylogenetic inference,
the topology of evolutionary trees must be identifiable. That is, from a joint
distribution the model predicts, it must be possible to recover the tree
parameter. We establish tree identifiability for a number of phylogenetic
models, including a covarion model and a variety of mixture models with a
limited number of classes. The proof is based on the introduction of a more
general model, allowing more states at internal nodes of the tree than at
leaves, and the study of the algebraic variety formed by the joint
distributions to which it gives rise. Tree identifiability is first established
for this general model through the use of certain phylogenetic invariants.Comment: 20 pages, 1 figur
Developing and applying heterogeneous phylogenetic models with XRate
Modeling sequence evolution on phylogenetic trees is a useful technique in
computational biology. Especially powerful are models which take account of the
heterogeneous nature of sequence evolution according to the "grammar" of the
encoded gene features. However, beyond a modest level of model complexity,
manual coding of models becomes prohibitively labor-intensive. We demonstrate,
via a set of case studies, the new built-in model-prototyping capabilities of
XRate (macros and Scheme extensions). These features allow rapid implementation
of phylogenetic models which would have previously been far more
labor-intensive. XRate's new capabilities for lineage-specific models,
ancestral sequence reconstruction, and improved annotation output are also
discussed. XRate's flexible model-specification capabilities and computational
efficiency make it well-suited to developing and prototyping phylogenetic
grammar models. XRate is available as part of the DART software package:
http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog
Phylogenetic Algebraic Geometry
Phylogenetic algebraic geometry is concerned with certain complex projective
algebraic varieties derived from finite trees. Real positive points on these
varieties represent probabilistic models of evolution. For small trees, we
recover classical geometric objects, such as toric and determinantal varieties
and their secant varieties, but larger trees lead to new and largely unexplored
territory. This paper gives a self-contained introduction to this subject and
offers numerous open problems for algebraic geometers.Comment: 15 pages, 7 figure
A Mutagenetic Tree Hidden Markov Model for Longitudinal Clonal HIV Sequence Data
RNA viruses provide prominent examples of measurably evolving populations. In
HIV infection, the development of drug resistance is of particular interest,
because precise predictions of the outcome of this evolutionary process are a
prerequisite for the rational design of antiretroviral treatment protocols. We
present a mutagenetic tree hidden Markov model for the analysis of longitudinal
clonal sequence data. Using HIV mutation data from clinical trials, we estimate
the order and rate of occurrence of seven amino acid changes that are
associated with resistance to the reverse transcriptase inhibitor efavirenz.Comment: 20 pages, 6 figure
Binary hidden Markov models and varieties
The technological applications of hidden Markov models have been extremely
diverse and successful, including natural language processing, gesture
recognition, gene sequencing, and Kalman filtering of physical measurements.
HMMs are highly non-linear statistical models, and just as linear models are
amenable to linear algebraic techniques, non-linear models are amenable to
commutative algebra and algebraic geometry.
This paper closely examines HMMs in which all the hidden random variables are
binary. Its main contributions are (1) a birational parametrization for every
such HMM, with an explicit inverse for recovering the hidden parameters in
terms of observables, (2) a semialgebraic model membership test for every such
HMM, and (3) minimal defining equations for the 4-node fully binary model,
comprising 21 quadrics and 29 cubics, which were computed using Grobner bases
in the cumulant coordinates of Sturmfels and Zwiernik. The new model parameters
in (1) are rationally identifiable in the sense of Sullivant, Garcia-Puente,
and Spielvogel, and each model's Zariski closure is therefore a rational
projective variety of dimension 5. Grobner basis computations for the model and
its graph are found to be considerably faster using these parameters. In the
case of two hidden states, item (2) supersedes a previous algorithm of
Schonhuth which is only generically defined, and the defining equations (3)
yield new invariants for HMMs of all lengths . Such invariants have
been used successfully in model selection problems in phylogenetics, and one
can hope for similar applications in the case of HMMs
- …