61,072 research outputs found
Protein structural class prediction based on an improved statistical strategy
<p>Abstract</p> <p>Background</p> <p>A protein structural class (PSC) belongs to the most basic but important classification in protein structures. The prediction technique of protein structural class has been developing for decades. Two popular indices are the amino-acid-frequency (AAF) based, and amino-acid-arrangement (AAA) with long-term correlation (LTC) – based indices. They were proposed in many works. Both indices have its pros and cons. For example, the AAF index focuses on a statistical analysis, while the AAA-LTC emphasizes the long-term, biological significance. Unfortunately, the datasets used in previous work were not very reliable for a small number of sequences with a high-sequence similarity.</p> <p>Results</p> <p>By modifying a statistical strategy, we proposed a new index method that combines probability and information theory together with a long-term correlation. We also proposed a numerically and biologically reliable dataset included more than 5700 sequences with a low sequence similarity. The results showed that the proposed approach has its high accuracy. Comparing with amino acid composition (AAC) index using a distance method, the accuracy of our approach has a 16–20% improvement for re-substitution test and about 6–11% improvement for cross-validation test. The values were about 23% and 15% for the component coupled method (CCM).</p> <p>Conclusion</p> <p>A new index method, combining probability and information theory together with a long-term correlation was proposed in this paper. The statistical method was improved significantly based on our new index. The cross validation test was conducted, and the result show the proposed method has a great improvement.</p
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon
Interaction between proteins is a fundamental mechanism that underlies
virtually all biological processes. Many important interactions are conserved
across a large variety of species. The need to maintain interaction leads to a
high degree of co-evolution between residues in the interface between partner
proteins. The inference of protein-protein interaction networks from the
rapidly growing sequence databases is one of the most formidable tasks in
systems biology today. We propose here a novel approach based on the
Direct-Coupling Analysis of the co-evolution between inter-protein residue
pairs. We use ribosomal and trp operon proteins as test cases: For the small
resp. large ribosomal subunit our approach predicts protein-interaction
partners at a true-positive rate of 70% resp. 90% within the first 10
predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all
predictions. In the trp operon, it assigns the two largest interaction scores
to the only two interactions experimentally known. On the level of residue
interactions we show that for both the small and the large ribosomal subunit
our approach predicts interacting residues in the system with a true positive
rate of 60% and 85% in the first 20 predictions. We use artificial data to show
that the performance of our approach depends crucially on the size of the joint
multiple sequence alignments and analyze how many sequences would be necessary
for a perfect prediction if the sequences were sampled from the same model that
we use for prediction. Given the performance of our approach on the test data
we speculate that it can be used to detect new interactions, especially in the
light of the rapid growth of available sequence data
A flexible integrative approach based on random forest improves prediction of transcription factor binding sites
Transcription factor binding sites (TFBSs) are DNA sequences of 6-15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. Here, we evaluate to what extent sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides (NPDs) and the nucleotide sequence-dependent structure of DNA. We make use of the random forest algorithm to flexibly exploit both types of information. Results in this study show that both the structural method and the NPD method can be valuable for the prediction of TFBSs. Moreover, their predictive values seem to be complementary, even to the widely used position weight matrix (PWM) method. This led us to combine all three methods. Results obtained for five eukaryotic TFs with different DNA-binding domains show that our method improves classification accuracy for all five eukaryotic TFs compared with other approaches. Additionally, we contrast the results of seven smaller prokaryotic sets with high-quality data and show that with the use of high-quality data we can significantly improve prediction performance. Models developed in this study can be of great use for gaining insight into the mechanisms of TF binding
Automated Protein Structure Classification: A Survey
Classification of proteins based on their structure provides a valuable
resource for studying protein structure, function and evolutionary
relationships. With the rapidly increasing number of known protein structures,
manual and semi-automatic classification is becoming ever more difficult and
prohibitively slow. Therefore, there is a growing need for automated, accurate
and efficient classification methods to generate classification databases or
increase the speed and accuracy of semi-automatic techniques. Recognizing this
need, several automated classification methods have been developed. In this
survey, we overview recent developments in this area. We classify different
methods based on their characteristics and compare their methodology, accuracy
and efficiency. We then present a few open problems and explain future
directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront
Protein Structure Prediction: The Next Generation
Over the last 10-15 years a general understanding of the chemical reaction of
protein folding has emerged from statistical mechanics. The lessons learned
from protein folding kinetics based on energy landscape ideas have benefited
protein structure prediction, in particular the development of coarse grained
models. We survey results from blind structure prediction. We explore how
second generation prediction energy functions can be developed by introducing
information from an ensemble of previously simulated structures. This procedure
relies on the assumption of a funnelled energy landscape keeping with the
principle of minimal frustration. First generation simulated structures provide
an improved input for associative memory energy functions in comparison to the
experimental protein structures chosen on the basis of sequence alignment
- …