4,852 research outputs found
Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks
Protein secondary structure prediction is an important problem in
bioinformatics. Inspired by the recent successes of deep neural networks, in
this paper, we propose an end-to-end deep network that predicts protein
secondary structures from integrated local and global contextual features. Our
deep architecture leverages convolutional neural networks with different kernel
sizes to extract multiscale local contextual features. In addition, considering
long-range dependencies existing in amino acid sequences, we set up a
bidirectional neural network consisting of gated recurrent unit to capture
global contextual features. Furthermore, multi-task learning is utilized to
predict secondary structure labels and amino-acid solvent accessibility
simultaneously. Our proposed deep network demonstrates its effectiveness by
achieving state-of-the-art performance, i.e., 69.7% Q8 accuracy on the public
benchmark CB513, 76.9% Q8 accuracy on CASP10 and 73.1% Q8 accuracy on CASP11.
Our model and results are publicly available.Comment: 8 pages, 3 figures, Accepted by International Joint Conferences on
Artificial Intelligence (IJCAI
PEvoLM: Protein Sequence Evolutionary Information Language Model
With the exponential increase of the protein sequence databases over time,
multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive
and time-consuming database search to retrieve evolutionary information. The
resulting position-specific scoring matrices (PSSMs) of such search engines
represent a crucial input to many machine learning (ML) models in the field of
bioinformatics and computational biology. A protein sequence is a collection of
contiguous tokens or characters called amino acids (AAs). The analogy to
natural language allowed us to exploit the recent advancements in the field of
Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art
algorithms to bioinformatics. This research presents an Embedding Language
Model (ELMo), converting a protein sequence to a numerical vector
representation. While the original ELMo trained a 2-layer bidirectional Long
Short-Term Memory (LSTMs) network following a two-path architecture, one for
the forward and the second for the backward pass, by merging the idea of PSSMs
with the concept of transfer-learning, this work introduces a novel
bidirectional language model (bi-LM) with four times less free parameters and
using rather a single path for both passes. The model was trained not only on
predicting the next AA but also on the probability distribution of the next AA
derived from similar, yet different sequences as summarized in a PSSM,
simultaneously for multi-task learning, hence learning evolutionary information
of protein sequences as well. The network architecture and the pre-trained
model are made available as open source under the permissive MIT license on
GitHub at https://github.com/issararab/PEvoLM.Comment:
Protein Fold Recognition from Sequences using Convolutional and Recurrent Neural Networks
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets,along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/
Deep learning methods for protein torsion angle prediction
Background: Deep learning is one of the most powerful machine learning methods that has achieved the state-of-the-art performance in many domains. Since deep learning was introduced to the field of bioinformatics in 2012, it has achieved success in a number of areas such as protein residue-residue contact prediction, secondary structure prediction, and fold recognition. In this work, we developed deep learning methods to improve the prediction of torsion (dihedral) angles of proteins. Results: We design four different deep learning architectures to predict protein torsion angles. The architectures including deep neural network (DNN) and deep restricted Boltzmann machine (DRBN), deep recurrent neural network (DRNN) and deep recurrent restricted Boltzmann machine (DReRBM) since the protein torsion angle prediction is a sequence related problem. In addition to existing protein features, two new features (predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments) are used as input to each of the four deep learning architectures to predict phi and psi angles of protein backbone. The mean absolute error (MAE) of phi and psi angles predicted by DRNN, DReRBM, DRBM and DNN is about 20-21° and 29-30° on an independent dataset. The MAE of phi angle is comparable to the existing methods, but the MAE of psi angle is 29°, 2° lower than the existing methods. On the latest CASP12 targets, our methods also achieved the performance better than or comparable to a state-of-the art method. Conclusions: Our experiment demonstrates that deep learning is a valuable method for predicting protein torsion angles. The deep recurrent network architecture performs slightly better than deep feed-forward architecture, and the predicted residue contact number and the error distribution of torsion angles extracted from sequence fragments are useful features for improving prediction accuracy
- …