91,136 research outputs found
Consensus Fold Recognition by Predicted Model Quality
Protein structure prediction has been a fundamental challenge in the biological field. In this post-genomic era, the need for automated protein structure prediction has never been more evident and researchers are now focusing on developing computational techniques to predict three-dimensional structures with high throughput.
Consensus-based protein structure prediction methods are state-of-the-art in automatic protein structure prediction. A consensus-based server combines the outputs of several individual servers and tends to generate better predictions than any individual server. Consensus-based methods have proved to be successful in recent CASP (Critical Assessment of Structure Prediction).
In this thesis, a Support Vector Machine (SVM) regression-based consensus method is proposed for protein fold recognition, a key component for high throughput protein structure prediction and protein function annotation. The SVM first extracts the features of a structural model by comparing the model to the other models produced by all the individual servers. Then, the SVM predicts the quality of each model. The experimental results from several LiveBench data sets confirm that our proposed consensus method, SVM regression, consistently performs better than any individual server. Based on this method, we developed a meta server, the Alignment by Consensus Estimation (ACE)
CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification
DeepSF: deep convolutional neural network for mapping protein sequences to folds
Motivation
Protein fold recognition is an important problem in structural
bioinformatics. Almost all traditional fold recognition methods use sequence
(homology) comparison to indirectly predict the fold of a tar get protein based
on the fold of a template protein with known structure, which cannot explain
the relationship between sequence and fold. Only a few methods had been
developed to classify protein sequences into a small number of folds due to
methodological limitations, which are not generally useful in practice.
Results
We develop a deep 1D-convolution neural network (DeepSF) to directly classify
any protein se quence into one of 1195 known folds, which is useful for both
fold recognition and the study of se quence-structure relationship. Different
from traditional sequence alignment (comparison) based methods, our method
automatically extracts fold-related features from a protein sequence of any
length and map it to the fold space. We train and test our method on the
datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On
the independent testing dataset curated from SCOP2.06, the classification
accuracy is 77.0%. We compare our method with a top profile profile alignment
method - HHSearch on hard template-based and template-free modeling targets of
CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is
14.5%-29.1% higher than HHSearch on template-free modeling targets and
4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10
predicted folds. The hidden features extracted from sequence by our method is
robust against sequence mutation, insertion, deletion and truncation, and can
be used for other protein pattern recognition problems such as protein
clustering, comparison and ranking.Comment: 28 pages, 13 figure
Structure and functional motifs of GCR1, the only plant protein with a GPCR fold?
Whether GPCRs exist in plants is a fundamental biological question. Interest in deorphanizing new G
protein coupled receptors (GPCRs), arises because of their importance in signaling. Within plants, this
is controversial as genome analysis has identified 56 putative GPCRs, including GCR1 which is
reportedly a remote homologue to class A, B and E GPCRs. Of these, GCR2, is not a GPCR; more
recently it has been proposed that none are, not even GCR1. We have addressed this disparity
between genome analysis and biological evidence through a structural bioinformatics study, involving
fold recognition methods, from which only GCR1 emerges as a strong candidate. To further probe
GCR1, we have developed a novel helix alignment method, which has been benchmarked against the
the class A – class B - class F GPCR alignments. In addition, we have presented a mutually consistent
set of alignments of GCR1 homologues to class A, class B and class F GPCRs, and shown that GCR1
is closer to class A and /or class B GPCRs than class A, class B or class F GPCRs are to each other.
To further probe GCR1, we have aligned transmembrane helix 3 of GCR1 to each of the 6 GPCR
classes. Variability comparisons provide additional evidence that GCR1 homologues have the GPCR
fold. From the alignments and a GCR1 comparative model we have identified motifs that are common
to GCR1, class A, B and E GPCRs. We discuss the possibilities that emerge from this controversial
evidence that GCR1 has a GPCR fol
Human pol II promoter prediction: time series descriptors and machine learning
Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters
Protein Structure Prediction: The Next Generation
Over the last 10-15 years a general understanding of the chemical reaction of
protein folding has emerged from statistical mechanics. The lessons learned
from protein folding kinetics based on energy landscape ideas have benefited
protein structure prediction, in particular the development of coarse grained
models. We survey results from blind structure prediction. We explore how
second generation prediction energy functions can be developed by introducing
information from an ensemble of previously simulated structures. This procedure
relies on the assumption of a funnelled energy landscape keeping with the
principle of minimal frustration. First generation simulated structures provide
an improved input for associative memory energy functions in comparison to the
experimental protein structures chosen on the basis of sequence alignment
- …