74 research outputs found
Prediction of 8-state protein secondary structures by a novel deep learning architecture
© 2018 The Author(s). Background: Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. Results: We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. Conclusion: Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning
Protein Fold Recognition from Sequences using Convolutional and Recurrent Neural Networks
The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets,along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/
Applying Deep Reinforcement Learning to the HP Model for Protein Structure Prediction
A central problem in computational biophysics is protein structure
prediction, i.e., finding the optimal folding of a given amino acid sequence.
This problem has been studied in a classical abstract model, the HP model,
where the protein is modeled as a sequence of H (hydrophobic) and P (polar)
amino acids on a lattice. The objective is to find conformations maximizing H-H
contacts. It is known that even in this reduced setting, the problem is
intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL)
to the two-dimensional HP model. We can obtain the conformations of best known
energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is
based on a deep Q-network (DQN). We find that a DQN based on long short-term
memory (LSTM) architecture greatly enhances the RL learning ability and
significantly improves the search process. DRL can sample the state space
efficiently, without the need of manual heuristics. Experimentally we show that
it can find multiple distinct best-known solutions per trial. This study
demonstrates the effectiveness of deep reinforcement learning in the HP model
for protein folding.Comment: Published at Physica A: Statistical Mechanics and its Applications,
available online 7 December 2022. Extended abstract accepted by the Machine
Learning and the Physical Sciences workshop, NeurIPS 202
Deep Learning for Genomics: A Concise Overview
Advancements in genomic research such as high-throughput sequencing
techniques have driven modern genomic studies into "big data" disciplines. This
data explosion is constantly challenging conventional methods used in genomics.
In parallel with the urgent demand for robust algorithms, deep learning has
succeeded in a variety of fields such as vision, speech, and text processing.
Yet genomics entails unique challenges to deep learning since we are expecting
from deep learning a superhuman intelligence that explores beyond our knowledge
to interpret the genome. A powerful deep learning model should rely on
insightful utilization of task-specific knowledge. In this paper, we briefly
discuss the strengths of different deep learning models from a genomic
perspective so as to fit each particular task with a proper deep architecture,
and remark on practical considerations of developing modern deep learning
architectures for genomics. We also provide a concise review of deep learning
applications in various aspects of genomic research, as well as pointing out
potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning
Application
Recommended from our members
DeepsmirUD: Prediction of Regulatory Effects on microRNA Expression Mediated by Small Molecules Using Deep Learning
Aberrant miRNA expression has been associated with a large number of human diseases. Therefore, targeting miRNAs to regulate their expression levels has become an important therapy against diseases that stem from the dysfunction of pathways regulated by miRNAs. In recent years, small molecules have demonstrated enormous potential as drugs to regulate miRNA expression (i.e., SM-miR). A clear understanding of the mechanism of action of small molecules on the upregulation and downregulation of miRNA expression allows precise diagnosis and treatment of oncogenic pathways. However, outside of a slow and costly process of experimental determination, computational strategies to assist this on an ad hoc basis have yet to be formulated. In this work, we developed, to the best of our knowledge, the first cross-platform prediction tool, DeepsmirUD, to infer small-molecule-mediated regulatory effects on miRNA expression (i.e., upregulation or downregulation). This method is powered by 12 cutting-edge deep-learning frameworks and achieved AUC values of 0.843/0.984 and AUCPR values of 0.866/0.992 on two independent test datasets. With a complementarily constructed network inference approach based on similarity, we report a significantly improved accuracy of 0.813 in determining the regulatory effects of nearly 650 associated SM-miR relations, each formed with either novel small molecule or novel miRNA. By further integrating miRNA–cancer relationships, we established a database of potential pharmaceutical drugs from 1343 small molecules for 107 cancer diseases to understand the drug mechanisms of action and offer novel insight into drug repositioning. Furthermore, we have employed DeepsmirUD to predict the regulatory effects of a large number of high-confidence associated SM-miR relations. Taken together, our method shows promise to accelerate the development of potential miRNA targets and small molecule drugs
PEvoLM: Protein Sequence Evolutionary Information Language Model
With the exponential increase of the protein sequence databases over time,
multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive
and time-consuming database search to retrieve evolutionary information. The
resulting position-specific scoring matrices (PSSMs) of such search engines
represent a crucial input to many machine learning (ML) models in the field of
bioinformatics and computational biology. A protein sequence is a collection of
contiguous tokens or characters called amino acids (AAs). The analogy to
natural language allowed us to exploit the recent advancements in the field of
Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art
algorithms to bioinformatics. This research presents an Embedding Language
Model (ELMo), converting a protein sequence to a numerical vector
representation. While the original ELMo trained a 2-layer bidirectional Long
Short-Term Memory (LSTMs) network following a two-path architecture, one for
the forward and the second for the backward pass, by merging the idea of PSSMs
with the concept of transfer-learning, this work introduces a novel
bidirectional language model (bi-LM) with four times less free parameters and
using rather a single path for both passes. The model was trained not only on
predicting the next AA but also on the probability distribution of the next AA
derived from similar, yet different sequences as summarized in a PSSM,
simultaneously for multi-task learning, hence learning evolutionary information
of protein sequences as well. The network architecture and the pre-trained
model are made available as open source under the permissive MIT license on
GitHub at https://github.com/issararab/PEvoLM.Comment:
Development of a deep learning-based computational framework for the classification of protein sequences
Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they
perform multiple biological functions. Each protein has different characteristics and properties,
which can be employed in many industries, such as industrial biotechnology, clinical applications,
among others, demonstrating a positive impact.
Modern high-throughput methods allow protein sequencing, which provides the protein
sequence data. Machine learning methodologies are applied to characterize proteins using
information of the protein sequence. However, a major problem associated with this method
is how to properly encode the protein sequences without losing the biological relationship
between the amino acid residues. The transformation of the protein sequence into a numeric
representation is done by encoder methods. In this sense, the main objective of this project is to
study different encoders and identify the methods which yield the best biological representation
of the protein sequences, when used in machine learning (ML) models to predict different labels
related to their function.
The methods were analyzed in two study cases. The first is related to enzymes, since
they are a well-established case in the literature. The second used transporter sequences, a
lesser studied case in the literature. In both cases, the data was collected from the curated
database Swiss-Prot. The encoders that were tested include: calculated protein descriptors;
matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained
transformer methods. The use of state-of-the-art pretrained transformers to encode protein
sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient
above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras
Quantitative approaches for decoding the specificity of the human T cell repertoire
T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method’s mathematical approach, predictive performance, and limitations
Protein language representation learning to predict SARS-CoV-2 mutational landscape
With the proliferation of SARS-CoV-2 pandemic globally, numerous variants have been emerging on a daily basis containing distinct transmission and infection rates, risks and impact over evasion of antibody neutralisation. Early discovery of high-risk mutations is critical towards undertaking data-informed therapeutic design decisions and effective pandemic management. This dissertation explores the application of Language Models, commonly used for textual processing, to decipher SARS-CoV-2 spike protein sequences which are an amalgamation of amino acids represented as alphabets. Deep protein language models are revolutionising protein biology, and with the introduction of two novel models: transformer encoder-based sequence only CoVBERT for predicting point mutations, and MuFormer which leverages the sequence and structural space to design mutational protein sequences iteratively. CoVBERT has been able to predict highly transmissible mutations including D614G with a masked marginal log likelihood of 0.95, surpassing state-of-the-art large protein language models. This reflects over large language models ability to encapture in vitro mutagenesis by learning the language of evolution.
MuFormer is capable of generating de novo protein sequences using AlphaFold2 for fixed backbone design, and curates evolutionary novel mutational sequences by injecting the representation derived state-of-the-art protein language models. The generated mutational sequences have been validated with historical data which exemplified the ability of MuFormer to capture phylogenetic properties for generating mutations such as Omicron and Delta variant, given Alpha variant as the input. MuFormer conditions not only over the sequence, but also the structure to generate end-to-end protein sequences and structure by optimising using two strategies of fixed backbone design (MuFormer-fixbb) and backbone atom optimisation (MuFormer-bba). Both these variants of MuFormer outperformed AlphaFold2 over the mutational sequence generation task for several structure and sequence likelihood metrics. These models ascertain over the potential of large language models, termed as foundational models, towards learning the representational language of biology which can assist in controlling pandemics by predicting mutations with higher infectivity in advance
- …