9,149 research outputs found
Correlation-Compressed Direct Coupling Analysis
Learning Ising or Potts models from data has become an important topic in
statistical physics and computational biology, with applications to predictions
of structural contacts in proteins and other areas of biological data analysis.
The corresponding inference problems are challenging since the normalization
constant (partition function) of the Ising/Potts distributions cannot be
computed efficiently on large instances. Different ways to address this issue
have hence given size to a substantial methodological literature. In this paper
we investigate how these methods could be used on much larger datasets than
studied previously. We focus on a central aspect, that in practice these
inference problems are almost always severely under-sampled, and the
operational result is almost always a small set of leading (largest)
predictions. We therefore explore an approach where the data is pre-filtered
based on empirical correlations, which can be computed directly even for very
large problems. Inference is only used on the much smaller instance in a
subsequent step of the analysis. We show that in several relevant model classes
such a combined approach gives results of almost the same quality as the
computationally much more demanding inference on the whole dataset. We also
show that results on whole-genome epistatic couplings that were obtained in a
recent computation-intensive study can be retrieved by the new approach. The
method of this paper hence opens up the possibility to learn parameters
describing pair-wise dependencies in whole genomes in a computationally
feasible and expedient manner.Comment: 15 pages, including 11 figure
Hidden Markov Models for Gene Sequence Classification: Classifying the VSG genes in the Trypanosoma brucei Genome
The article presents an application of Hidden Markov Models (HMMs) for
pattern recognition on genome sequences. We apply HMM for identifying genes
encoding the Variant Surface Glycoprotein (VSG) in the genomes of Trypanosoma
brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa
causative agents of sleeping sickness and several diseases in domestic and wild
animals. These parasites have a peculiar strategy to evade the host's immune
system that consists in periodically changing their predominant cellular
surface protein (VSG). The motivation for using patterns recognition methods to
identify these genes, instead of traditional homology based ones, is that the
levels of sequence identity (amino acid and DNA sequence) amongst these genes
is often below of what is considered reliable in these methods. Among pattern
recognition approaches, HMM are particularly suitable to tackle this problem
because they can handle more naturally the determination of gene edges. We
evaluate the performance of the model using different number of states in the
Markov model, as well as several performance metrics. The model is applied
using public genomic data. Our empirical results show that the VSG genes on T.
brucei can be safely identified (high sensitivity and low rate of false
positives) using HMM.Comment: Accepted article in July, 2015 in Pattern Analysis and Applications,
Springer. The article contains 23 pages, 4 figures, 8 tables and 51
reference
Developing and applying heterogeneous phylogenetic models with XRate
Modeling sequence evolution on phylogenetic trees is a useful technique in
computational biology. Especially powerful are models which take account of the
heterogeneous nature of sequence evolution according to the "grammar" of the
encoded gene features. However, beyond a modest level of model complexity,
manual coding of models becomes prohibitively labor-intensive. We demonstrate,
via a set of case studies, the new built-in model-prototyping capabilities of
XRate (macros and Scheme extensions). These features allow rapid implementation
of phylogenetic models which would have previously been far more
labor-intensive. XRate's new capabilities for lineage-specific models,
ancestral sequence reconstruction, and improved annotation output are also
discussed. XRate's flexible model-specification capabilities and computational
efficiency make it well-suited to developing and prototyping phylogenetic
grammar models. XRate is available as part of the DART software package:
http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog
Covariance models for RNA structure prediction
Many non-coding RNAs are known to play a role in the cell directly linked to their structure. Structure prediction based on the sole sequence is however a challenging task. On the other hand, thanks to the low cost of sequencing technologies, a very large number of homologous
sequences are becoming available for many RNA families. In the protein community, it has emerged in the last decade the idea of exploiting the covariance of mutations within a family to predict the protein structure using the direct- coupling-analysis (DCA) method. The application of DCA to RNA systems has been limited so far. We here perform an assessment of the DCA method on 17 riboswitch families, comparing it with the commonly used mutual information analysis. We also compare different flavors of DCA, including mean-field, pseudo-likelihood, and a proposed stochastic procedure (Boltzmann learning) for solving exactly the DCA inverse problem. Boltzmann learning outperforms the other methods in predicting contacts observed in high resolution crystal structures. In order to enhance the prediction of both RNA secondary and tertiary contacts, we discuss the possibility to include of a number of informed priors in the estimation of the couplings for the DCA statistical model. We observe a systematic improvement of the DCA performance by embedding in the prior distribution the pairing probability matrices calculated using secondary-structure prediction algorithms
- …