69 research outputs found

    Prediction of 8-state protein secondary structures by a novel deep learning architecture

    Full text link
    © 2018 The Author(s). Background: Protein secondary structure can be regarded as an information bridge that links the primary sequence and tertiary structure. Accurate 8-state secondary structure prediction can significantly give more precise and high resolution on structure-based properties analysis. Results: We present a novel deep learning architecture which exploits an integrative synergy of prediction by a convolutional neural network, residual network, and bidirectional recurrent neural network to improve the performance of protein secondary structure prediction. A local block comprised of convolutional filters and original input is designed for capturing local sequence features. The subsequent bidirectional recurrent neural network consisting of gated recurrent units can capture global context features. Furthermore, the residual network can improve the information flow between the hidden layers and the cascaded recurrent neural network. Our proposed deep network achieved 71.4% accuracy on the benchmark CB513 dataset for the 8-state prediction; and the ensemble learning by our model achieved 74% accuracy. Our model generalization capability is also evaluated on other three independent datasets CASP10, CASP11 and CASP12 for both 8- and 3-state prediction. These prediction performances are superior to the state-of-the-art methods. Conclusion: Our experiment demonstrates that it is a valuable method for predicting protein secondary structure, and capturing local and global features concurrently is very useful in deep learning

    Protein Fold Recognition from Sequences using Convolutional and Recurrent Neural Networks

    Get PDF
    The identification of a protein fold type from its amino acid sequence provides important insights about the protein 3D structure. In this paper, we propose a deep learning architecture that can process protein residue-level features to address the protein fold recognition task. Our neural network model combines 1D-convolutional layers with gated recurrent unit (GRU) layers. The GRU cells, as recurrent layers, cope with the processing issues associated to the highly variable protein sequence lengths and so extract a fold-related embedding of fixed size for each protein domain. These embeddings are then used to perform the pairwise fold recognition task, which is based on transferring the fold type of the most similar template structure. We compare our model with several template-based and deep learning-based methods from the state-of-the-art. The evaluation results over the well-known LINDAHL and SCOP_TEST sets,along with a proposed LINDAHL test set updated to SCOP 1.75, show that our embeddings perform significantly better than these methods, specially at the fold level. Supplementary material, source code and trained models are available at http://sigmat.ugr.es/~amelia/CNN-GRU-RF+/

    Applying Deep Reinforcement Learning to the HP Model for Protein Structure Prediction

    Full text link
    A central problem in computational biophysics is protein structure prediction, i.e., finding the optimal folding of a given amino acid sequence. This problem has been studied in a classical abstract model, the HP model, where the protein is modeled as a sequence of H (hydrophobic) and P (polar) amino acids on a lattice. The objective is to find conformations maximizing H-H contacts. It is known that even in this reduced setting, the problem is intractable (NP-hard). In this work, we apply deep reinforcement learning (DRL) to the two-dimensional HP model. We can obtain the conformations of best known energies for benchmark HP sequences with lengths from 20 to 50. Our DRL is based on a deep Q-network (DQN). We find that a DQN based on long short-term memory (LSTM) architecture greatly enhances the RL learning ability and significantly improves the search process. DRL can sample the state space efficiently, without the need of manual heuristics. Experimentally we show that it can find multiple distinct best-known solutions per trial. This study demonstrates the effectiveness of deep reinforcement learning in the HP model for protein folding.Comment: Published at Physica A: Statistical Mechanics and its Applications, available online 7 December 2022. Extended abstract accepted by the Machine Learning and the Physical Sciences workshop, NeurIPS 202

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    PEvoLM: Protein Sequence Evolutionary Information Language Model

    Full text link
    With the exponential increase of the protein sequence databases over time, multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive and time-consuming database search to retrieve evolutionary information. The resulting position-specific scoring matrices (PSSMs) of such search engines represent a crucial input to many machine learning (ML) models in the field of bioinformatics and computational biology. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). The analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art algorithms to bioinformatics. This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation. While the original ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTMs) network following a two-path architecture, one for the forward and the second for the backward pass, by merging the idea of PSSMs with the concept of transfer-learning, this work introduces a novel bidirectional language model (bi-LM) with four times less free parameters and using rather a single path for both passes. The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM, simultaneously for multi-task learning, hence learning evolutionary information of protein sequences as well. The network architecture and the pre-trained model are made available as open source under the permissive MIT license on GitHub at https://github.com/issararab/PEvoLM.Comment:

    Development of a deep learning-based computational framework for the classification of protein sequences

    Get PDF
    Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they perform multiple biological functions. Each protein has different characteristics and properties, which can be employed in many industries, such as industrial biotechnology, clinical applications, among others, demonstrating a positive impact. Modern high-throughput methods allow protein sequencing, which provides the protein sequence data. Machine learning methodologies are applied to characterize proteins using information of the protein sequence. However, a major problem associated with this method is how to properly encode the protein sequences without losing the biological relationship between the amino acid residues. The transformation of the protein sequence into a numeric representation is done by encoder methods. In this sense, the main objective of this project is to study different encoders and identify the methods which yield the best biological representation of the protein sequences, when used in machine learning (ML) models to predict different labels related to their function. The methods were analyzed in two study cases. The first is related to enzymes, since they are a well-established case in the literature. The second used transporter sequences, a lesser studied case in the literature. In both cases, the data was collected from the curated database Swiss-Prot. The encoders that were tested include: calculated protein descriptors; matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained transformer methods. The use of state-of-the-art pretrained transformers to encode protein sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras

    Quantitative approaches for decoding the specificity of the human T cell repertoire

    Get PDF
    T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method’s mathematical approach, predictive performance, and limitations

    Structural Property Prediction

    Full text link
    While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. Some structural properties of proteins that are closely linked to their function may be easier (or much faster) to predict from sequence than the complete tertiary structure; for example, secondary structure, surface accessibility, flexibility, disorder, interface regions or hydrophobic patches. Serving as building blocks for the native protein fold, these structural properties also contain important structural and functional information not apparent from the amino acid sequence. Here, we will first give an introduction into the application of machine learning for structural property prediction, and explain the concepts of cross-validation and benchmarking. Next, we will review various methods that incorporate knowledge of these concepts to predict those structural properties, such as secondary structure, surface accessibility, disorder and flexibility, and aggregation.Comment: editorial responsability: Juami H. M. van Gils, K. Anton Feenstra, Sanne Abeln. This chapter is part of the book "Introduction to Protein Structural Bioinformatics". The Preface arXiv:1801.09442 contains links to all the (published) chapter
    corecore