34,409 research outputs found
CATHEDRAL: A Fast and Effective Algorithm to Predict Folds and Domain Boundaries from Multidomain Protein Structures
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification
DeepSF: deep convolutional neural network for mapping protein sequences to folds
Motivation
Protein fold recognition is an important problem in structural
bioinformatics. Almost all traditional fold recognition methods use sequence
(homology) comparison to indirectly predict the fold of a tar get protein based
on the fold of a template protein with known structure, which cannot explain
the relationship between sequence and fold. Only a few methods had been
developed to classify protein sequences into a small number of folds due to
methodological limitations, which are not generally useful in practice.
Results
We develop a deep 1D-convolution neural network (DeepSF) to directly classify
any protein se quence into one of 1195 known folds, which is useful for both
fold recognition and the study of se quence-structure relationship. Different
from traditional sequence alignment (comparison) based methods, our method
automatically extracts fold-related features from a protein sequence of any
length and map it to the fold space. We train and test our method on the
datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On
the independent testing dataset curated from SCOP2.06, the classification
accuracy is 77.0%. We compare our method with a top profile profile alignment
method - HHSearch on hard template-based and template-free modeling targets of
CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is
14.5%-29.1% higher than HHSearch on template-free modeling targets and
4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10
predicted folds. The hidden features extracted from sequence by our method is
robust against sequence mutation, insertion, deletion and truncation, and can
be used for other protein pattern recognition problems such as protein
clustering, comparison and ranking.Comment: 28 pages, 13 figure
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
Recently exciting progress has been made on protein contact prediction, but
the predicted contacts for proteins without many sequence homologs is still of
low quality and not very useful for de novo structure prediction. This paper
presents a new deep learning method that predicts contacts by integrating both
evolutionary coupling (EC) and sequence conservation information through an
ultra-deep neural network formed by two deep residual networks. This deep
neural network allows us to model very complex sequence-contact relationship as
well as long-range inter-contact correlation. Our method greatly outperforms
existing contact prediction methods and leads to much more accurate
contact-assisted protein folding. Tested on three datasets of 579 proteins, the
average top L long-range prediction accuracy obtained our method, the
representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21
and 0.30, respectively; the average top L/10 long-range accuracy of our method,
CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding
using our predicted contacts as restraints can yield correct folds (i.e.,
TMscore>0.6) for 203 test proteins, while that using MetaPSICOV- and
CCMpred-predicted contacts can do so for only 79 and 62 proteins, respectively.
Further, our contact-assisted models have much better quality than
template-based models. Using our predicted contacts as restraints, we can (ab
initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast,
when the training proteins of our method are used as templates, homology
modeling can only do so for 10 of them. One interesting finding is that even if
we do not train our prediction models with any membrane proteins, our method
works very well on membrane protein prediction. Finally, in recent blind CAMEO
benchmark our method successfully folded 5 test proteins with a novel fold
The Phyre2 web portal for protein modeling, prediction and analysis
Phyre2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a paper in Nature Protocols. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a user's protein sequence. Users are guided through results by a simple interface at a level of detail they determine. This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30 min and 2 h after submission
Introduction to Protein Structure Prediction
This chapter gives a graceful introduction to problem of protein three-
dimensional structure prediction, and focuses on how to make structural sense
out of a single input sequence with unknown structure, the 'query' or 'target'
sequence. We give an overview of the different classes of modelling techniques,
notably template-based and template free. We also discuss the way in which
structural predictions are validated within the global com- munity, and
elaborate on the extent to which predicted structures may be trusted and used
in practice. Finally we discuss whether the concept of a sin- gle fold
pertaining to a protein structure is sustainable given recent insights. In
short, we conclude that the general protein three-dimensional structure
prediction problem remains unsolved, especially if we desire quantitative
predictions. However, if a homologous structural template is available in the
PDB model or reasonable to high accuracy may be generated
Protein secondary structure: Entropy, correlations and prediction
Is protein secondary structure primarily determined by local interactions
between residues closely spaced along the amino acid backbone, or by non-local
tertiary interactions? To answer this question we have measured the entropy
densities of primary structure and secondary structure sequences, and the local
inter-sequence mutual information density. We find that the important
inter-sequence interactions are short ranged, that correlations between
neighboring amino acids are essentially uninformative, and that only 1/4 of the
total information needed to determine the secondary structure is available from
local inter-sequence correlations. Since the remaining information must come
from non-local interactions, this observation supports the view that the
majority of most proteins fold via a cooperative process where secondary and
tertiary structure form concurrently. To provide a more direct comparison to
existing secondary structure prediction methods, we construct a simple hidden
Markov model (HMM) of the sequences. This HMM achieves a prediction accuracy
comparable to other single sequence secondary structure prediction algorithms,
and can extract almost all of the inter-sequence mutual information. This
suggests that these algorithms are almost optimal, and that we should not
expect a dramatic improvement in prediction accuracy. However, local
correlations between secondary and primary structure are probably of
under-appreciated importance in many tertiary structure prediction methods,
such as threading.Comment: 8 pages, 5 figure
Distance-based Protein Folding Powered by Deep Learning
Contact-assisted protein folding has made very good progress, but two
challenges remain. One is accurate contact prediction for proteins lack of many
sequence homologs and the other is that time-consuming folding simulation is
often needed to predict good 3D models from predicted contacts. We show that
protein distance matrix can be predicted well by deep learning and then
directly used to construct 3D models without folding simulation at all. Using
distance geometry to construct 3D models from our predicted distance matrices,
we successfully folded 21 of the 37 CASP12 hard targets with a median family
size of 58 effective sequence homologs within 4 hours on a Linux computer of 20
CPUs. In contrast, contacts predicted by direct coupling analysis (DCA) cannot
fold any of them in the absence of folding simulation and the best CASP12 group
folded 11 of them by integrating predicted contacts into complex,
fragment-based folding simulation. The rigorous experimental validation on 15
CASP13 targets show that among the 3 hardest targets of new fold our
distance-based folding servers successfully folded 2 large ones with <150
sequence homologs while the other servers failed on all three, and that our ab
initio folding server also predicted the best, high-quality 3D model for a
large homology modeling target. Further experimental validation in CAMEO shows
that our ab initio folding server predicted correct fold for a membrane protein
of new fold with 200 residues and 229 sequence homologs while all the other
servers failed. These results imply that deep learning offers an efficient and
accurate solution for ab initio folding on a personal computer
- …