1,009 research outputs found
Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences
Amino acid sequence portrays most intrinsic form of a protein and expresses
primary structure of protein. The order of amino acids in a sequence enables a
protein to acquire a particular stable conformation that is responsible for the
functions of the protein. This relationship between a sequence and its function
motivates the need to analyse the sequences for predicting protein functions.
Early generation computational methods using BLAST, FASTA, etc. perform
function transfer based on sequence similarity with existing databases and are
computationally slow. Although machine learning based approaches are fast, they
fail to perform well for long protein sequences (i.e., protein sequences with
more than 300 amino acid residues). In this paper, we introduce a novel method
for construction of two separate feature sets for protein sequences based on
analysis of 1) single fixed-sized segments and 2) multi-sized segments, using
bi-directional long short-term memory network. Further, model based on proposed
feature set is combined with the state of the art Multi-lable Linear
Discriminant Analysis (MLDA) features based model to improve the accuracy.
Extensive evaluations using separate datasets for biological processes and
molecular functions demonstrate promising results for both single-sized and
multi-sized segments based feature sets. While former showed an improvement of
+3.37% and +5.48%, the latter produces an improvement of +5.38% and +8.00%
respectively for two datasets over the state of the art MLDA based classifier.
After combining two models, there is a significant improvement of +7.41% and
+9.21% respectively for two datasets compared to MLDA based classifier.
Specifically, the proposed approach performed well for the long protein
sequences and superior overall performance
Protein Fold Recognition from Sequences using Convolutional and Recurrent Neural Networks
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets,along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/
Deep Learning for Genomics: A Concise Overview
Advancements in genomic research such as high-throughput sequencing
techniques have driven modern genomic studies into "big data" disciplines. This
data explosion is constantly challenging conventional methods used in genomics.
In parallel with the urgent demand for robust algorithms, deep learning has
succeeded in a variety of fields such as vision, speech, and text processing.
Yet genomics entails unique challenges to deep learning since we are expecting
from deep learning a superhuman intelligence that explores beyond our knowledge
to interpret the genome. A powerful deep learning model should rely on
insightful utilization of task-specific knowledge. In this paper, we briefly
discuss the strengths of different deep learning models from a genomic
perspective so as to fit each particular task with a proper deep architecture,
and remark on practical considerations of developing modern deep learning
architectures for genomics. We also provide a concise review of deep learning
applications in various aspects of genomic research, as well as pointing out
potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning
Application
Contrastive learning on protein embeddings enlightens midnight zone
Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT
- …