10,050 research outputs found
Automated Protein Structure Classification: A Survey
Classification of proteins based on their structure provides a valuable
resource for studying protein structure, function and evolutionary
relationships. With the rapidly increasing number of known protein structures,
manual and semi-automatic classification is becoming ever more difficult and
prohibitively slow. Therefore, there is a growing need for automated, accurate
and efficient classification methods to generate classification databases or
increase the speed and accuracy of semi-automatic techniques. Recognizing this
need, several automated classification methods have been developed. In this
survey, we overview recent developments in this area. We classify different
methods based on their characteristics and compare their methodology, accuracy
and efficiency. We then present a few open problems and explain future
directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
EFICAz²: enzyme function inference by a combined approach enhanced by machine learning
©2009 Arakaki et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/10/107doi:10.1186/1471-2105-10-107Background: We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment. Results: We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz², exhibits a highly improved prediction precision at MTTSI < 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz² and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz² generates considerably more unique assignments than KEGG. Conclusion: Performance benchmarks and the comparison with KEGG demonstrate that EFICAz² is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz² web service is available at: http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.htm
Functional classification of G-Protein coupled receptors, based on their specific ligand coupling patterns
Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them re- main as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization
Protein Tertiary Model Assessment Using Granular Machine Learning Techniques
The automatic prediction of protein three dimensional structures from its amino acid sequence has become one of the most important and researched fields in bioinformatics. As models are not experimental structures determined with known accuracy but rather with prediction it’s vital to determine estimates of models quality. We attempt to solve this problem using machine learning techniques and information from both the sequence and structure of the protein. The goal is to generate a machine that understands structures from PDB and when given a new model, predicts whether it belongs to the same class as the PDB structures (correct or incorrect protein models). Different subsets of PDB (protein data bank) are considered for evaluating the prediction potential of the machine learning methods. Here we show two such machines, one using SVM (support vector machines) and another using fuzzy decision trees (FDT). First using a preliminary encoding style SVM could get around 70% in protein model quality assessment accuracy, and improved Fuzzy Decision Tree (IFDT) could reach above 80% accuracy. For the purpose of reducing computational overhead multiprocessor environment and basic feature selection method is used in machine learning algorithm using SVM.
Next an enhanced scheme is introduced using new encoding style. In the new style, information like amino acid substitution matrix, polarity, secondary structure information and relative distance between alpha carbon atoms etc is collected through spatial traversing of the 3D structure to form training vectors. This guarantees that the properties of alpha carbon atoms that are close together in 3D space and thus interacting are used in vector formation. With the use of fuzzy decision tree, we obtained a training accuracy around 90%. There is significant improvement compared to previous encoding technique in prediction accuracy and execution time. This outcome motivates to continue to explore effective machine learning algorithms for accurate protein model quality assessment.
Finally these machines are tested using CASP8 and CASP9 templates and compared with other CASP competitors, with promising results. We further discuss the importance of model quality assessment and other information from proteins that could be considered for the same
Automatic structure classification of small proteins using random forest
<p>Abstract</p> <p><b>Background</b></p> <p>Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs.</p> <p><b>Result</b>s</p> <p>Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP <it>Class, Fold, Super-family </it>or <it>Family </it>levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases.</p> <p>Conclusions</p> <p>The utility of random forest in classifying domains from the place-holder classes of SCOP to the true <it>Class, Fold, Super-family </it>or <it>Family </it>levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.</p
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
- …