18 research outputs found
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families
In the protein sequence space, natural proteins form clusters of families
which are characterized by their unique native folds whereas the great majority
of random polypeptides are neither clustered nor foldable to unique structures.
Since a given polypeptide can be either foldable or unfoldable, a kind of
"folding transition" is expected at the boundary of a protein family in the
sequence space. By Monte Carlo simulations of a statistical mechanical model of
protein sequence alignment that coherently incorporates both short-range and
long-range interactions as well as variable-length insertions to reproduce the
statistics of the multiple sequence alignment of a given protein family, we
demonstrate the existence of such transition between natural-like sequences and
random sequences in the sequence subspaces for 15 domain families of various
folds. The transition was found to be highly cooperative and two-state-like.
Furthermore, enforcing or suppressing consensus residues on a few of the
well-conserved sites enhanced or diminished, respectively, the natural-like
pattern formation over the entire sequence. In most families, the key sites
included ligand binding sites. These results suggest some selective pressure on
the key residues, such as ligand binding activity, may cooperatively facilitate
the emergence of a protein family during evolution. From a more practical
aspect, the present results highlight an essential role of long-range effects
in precisely defining protein families, which are absent in conventional
sequence models.Comment: 13 pages, 7 figures, 2 tables (a new subsection added
Protein Fold Recognition from Sequences using Convolutional and Recurrent Neural Networks
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets,along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/
DeepSF: deep convolutional neural network for mapping protein sequences to folds
Motivation
Protein fold recognition is an important problem in structural
bioinformatics. Almost all traditional fold recognition methods use sequence
(homology) comparison to indirectly predict the fold of a tar get protein based
on the fold of a template protein with known structure, which cannot explain
the relationship between sequence and fold. Only a few methods had been
developed to classify protein sequences into a small number of folds due to
methodological limitations, which are not generally useful in practice.
Results
We develop a deep 1D-convolution neural network (DeepSF) to directly classify
any protein se quence into one of 1195 known folds, which is useful for both
fold recognition and the study of se quence-structure relationship. Different
from traditional sequence alignment (comparison) based methods, our method
automatically extracts fold-related features from a protein sequence of any
length and map it to the fold space. We train and test our method on the
datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On
the independent testing dataset curated from SCOP2.06, the classification
accuracy is 77.0%. We compare our method with a top profile profile alignment
method - HHSearch on hard template-based and template-free modeling targets of
CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is
14.5%-29.1% higher than HHSearch on template-free modeling targets and
4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10
predicted folds. The hidden features extracted from sequence by our method is
robust against sequence mutation, insertion, deletion and truncation, and can
be used for other protein pattern recognition problems such as protein
clustering, comparison and ranking.Comment: 28 pages, 13 figure
EigenTHREADER: analogous protein fold recognition by efficient contact map threading
Motivation: Protein fold recognition when appropriate, evolutionarily-related, structural templates can be identified is often trivial and may even be viewed as a solved problem. However in cases where no homologous structural templates can be detected, fold recognition is a notoriously difficult problem (Moult et al., 2014). Here we present EigenTHREADER, a novel fold recognition method capable of identifying folds where no homologous structures can be identified. EigenTHREADER takes a query amino acid sequence, generates a map of intra-residue contacts, and then searches a library of contact maps of known structures. To allow the contact maps to be compared, we use eigenvector decomposition to resolve the principal eigenvectors these can then be aligned using standard dynamic programming algorithms. The approach is similar to the Al-Eigen approach of Di Lena et al. (2010), but with improvements made both to speed and accuracy. With this search strategy, EigenTHREADER does not depend directly on sequence homology between the target protein and entries in the fold library to generate models. This in turn enables EigenTHREADER to correctly identify analogous folds where little or no sequence homology information is.
Results: EigenTHREADER outperforms well-established fold recognition methods such as pGenTHREADER and HHSearch in terms of True Positive Rate in the difficult task of analogous fold recognition. This should allow template-based modelling to be extended to many new protein families that were previously intractable to homology based fold recognition methods.
Availability and implementation: All code used to generate these results and the computational protocol can be downloaded from https://github.com/DanBuchan/eigen_scripts. EigenTHREADER, the benchmark code and the data this paper is based on can be downloaded from: http://bioinfadmin.cs.ucl.ac.uk/downloads/eigenTHREADER/
DeepSF: Deep convolutional neural network for mapping protein sequences to folds
Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a target protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein sequence into one of 1195 known folds, which is useful for both fold recognition and the study of sequence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and maps it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding an average classification accuracy of 75.3%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 73.0%. We compare our method with a top profile-profile alignment method -HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 12.63-26.32% higher than HHSearch on template-free modeling targets and 3.39-17.09% higher on hard template-based modeling targets for top 1, 5 and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking