34,827 research outputs found
Empirical Potential Function for Simplified Protein Models: Combining Contact and Local Sequence-Structure Descriptors
An effective potential function is critical for protein structure prediction
and folding simulation. Simplified protein models such as those requiring only
or backbone atoms are attractive because they enable efficient
search of the conformational space. We show residue specific reduced discrete
state models can represent the backbone conformations of proteins with small
RMSD values. However, no potential functions exist that are designed for such
simplified protein models. In this study, we develop optimal potential
functions by combining contact interaction descriptors and local
sequence-structure descriptors. The form of the potential function is a
weighted linear sum of all descriptors, and the optimal weight coefficients are
obtained through optimization using both native and decoy structures. The
performance of the potential function in test of discriminating native protein
structures from decoys is evaluated using several benchmark decoy sets. Our
potential function requiring only backbone atoms or atoms have
comparable or better performance than several residue-based potential functions
that require additional coordinates of side chain centers or coordinates of all
side chain atoms. By reducing the residue alphabets down to size 5 for local
structure-sequence relationship, the performance of the potential function can
be further improved. Our results also suggest that local sequence-structure
correlation may play important role in reducing the entropic cost of protein
folding.Comment: 20 pages, 5 figures, 4 tables. In press, Protein
The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases
One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs
DeepSF: deep convolutional neural network for mapping protein sequences to folds
Motivation
Protein fold recognition is an important problem in structural
bioinformatics. Almost all traditional fold recognition methods use sequence
(homology) comparison to indirectly predict the fold of a tar get protein based
on the fold of a template protein with known structure, which cannot explain
the relationship between sequence and fold. Only a few methods had been
developed to classify protein sequences into a small number of folds due to
methodological limitations, which are not generally useful in practice.
Results
We develop a deep 1D-convolution neural network (DeepSF) to directly classify
any protein se quence into one of 1195 known folds, which is useful for both
fold recognition and the study of se quence-structure relationship. Different
from traditional sequence alignment (comparison) based methods, our method
automatically extracts fold-related features from a protein sequence of any
length and map it to the fold space. We train and test our method on the
datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On
the independent testing dataset curated from SCOP2.06, the classification
accuracy is 77.0%. We compare our method with a top profile profile alignment
method - HHSearch on hard template-based and template-free modeling targets of
CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is
14.5%-29.1% higher than HHSearch on template-free modeling targets and
4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10
predicted folds. The hidden features extracted from sequence by our method is
robust against sequence mutation, insertion, deletion and truncation, and can
be used for other protein pattern recognition problems such as protein
clustering, comparison and ranking.Comment: 28 pages, 13 figure
Distances and classification of amino acids for different protein secondary structures
Window profiles of amino acids in protein sequences are taken as a
description of the amino acid environment. The relative entropy or
Kullback-Leibler distance derived from profiles is used as a measure of
dissimilarity for comparison of amino acids and secondary structure
conformations. Distance matrices of amino acid pairs at different conformations
are obtained, which display a non-negligible dependence of amino acid
similarity on conformations. Based on the conformation specific distances
clustering analysis for amino acids is conducted.Comment: 15 pages, 8 figure
Recurrent oligomers in proteins - an optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies
A novel scheme is introduced to capture the spatial correlations of
consecutive amino acids in naturally occurring proteins. This knowledge-based
strategy is able to carry out optimally automated subdivisions of protein
fragments into classes of similarity. The goal is to provide the minimal set of
protein oligomers (termed ``oligons'' for brevity) that is able to represent
any other fragment. At variance with previous studies where recurrent local
motifs were classified, our concern is to provide simplified protein
representations that have been optimised for use in automated folding and/or
design attempts. In such contexts it is paramount to limit the number of
degrees of freedom per amino acid without incurring in loss of accuracy of
structural representations. The suggested method finds, by construction, the
optimal compromise between these needs. Several possible oligon lengths are
considered. It is shown that meaningful classifications cannot be done for
lengths greater than 6 or smaller than 4. Different contexts are considered
were oligons of length 5 or 6 are recommendable. With only a few dozen of
oligons of such length, virtually any protein can be reproduced within typical
experimental uncertainties. Structural data for the oligons is made publicly
available.Comment: 19 pages, 13 postscript figure
Potential function of simplified protein models for discriminating native proteins from decoys: Combining contact interaction and local sequence-dependent geometry
An effective potential function is critical for protein structure prediction
and folding simulation. For simplified models of proteins where coordinates of
only atoms need to be specified, an accurate potential function is
important. Such a simplified model is essential for efficient search of
conformational space. In this work, we present a formulation of potential
function for simplified representations of protein structures. It is based on
the combination of descriptors derived from residue-residue contact and
sequence-dependent local geometry. The optimal weight coefficients for contact
and local geometry is obtained through optimization by maximizing margins among
native and decoy structures. The latter are generated by chain growth and by
gapless threading. The performance of the potential function in blind test of
discriminating native protein structures from decoys is evaluated using several
benchmark decoy sets. This potential function have comparable or better
performance than several residue-based potential functions that require in
addition coordinates of side chain centers or coordinates of all side chain
atoms.Comment: 4 pages, 2 figures, Accepted by 26th IEEE-EMBS Conference, San
Francisc
CLP-based protein fragment assembly
The paper investigates a novel approach, based on Constraint Logic
Programming (CLP), to predict the 3D conformation of a protein via fragments
assembly. The fragments are extracted by a preprocessor-also developed for this
work- from a database of known protein structures that clusters and classifies
the fragments according to similarity and frequency. The problem of assembling
fragments into a complete conformation is mapped to a constraint solving
problem and solved using CLP. The constraint-based model uses a medium
discretization degree Ca-side chain centroid protein model that offers
efficiency and a good approximation for space filling. The approach adapts
existing energy models to the protein representation used and applies a large
neighboring search strategy. The results shows the feasibility and efficiency
of the method. The declarative nature of the solution allows to include future
extensions, e.g., different size fragments for better accuracy.Comment: special issue dedicated to ICLP 201
- …