12 research outputs found

    DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks

    Get PDF
    Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred

    Functional Performance of Plant Proteins

    Get PDF
    Increasingly, consumers are moving towards a more plant-based diet. However, some consumers are avoiding common plant proteins such as soy and gluten due to their potential allergenicity. Therefore, alternative protein sources are being explored as functional ingredients in foods, including pea, chickpea, and other legume proteins. The factors affecting the functional performance of plant proteins are outlined, including cultivars, genotypes, extraction and drying methods, protein level, and preparation methods (commercial versus laboratory). Current methods to characterize protein functionality are highlighted, including water and oil holding capacity, protein solubility, emulsifying, foaming, and gelling properties. We propose a series of analytical tests to better predict plant protein performance in foods. Representative applications are discussed to demonstrate how the functional attributes of plant proteins affect the physicochemical properties of plant-based foods. Increasing the protein content of plant protein ingredients enhances their water and oil holding capacity and foaming stability. Industrially produced plant proteins often have lower solubility and worse functionality than laboratory-produced ones due to protein denaturation and aggregation during commercial isolation processes. To better predict the functional performance of plant proteins, it would be useful to use computer modeling approaches, such as quantitative structural activity relationships (QSAR).Peer reviewe

    Machine Learning for Biosensors

    Get PDF
    Biosensors have become increasingly popular as diagnostic tools due to their ability to detect and quantify biological analytes in a wide range of applications. With the growing demand for faster and more reliable biosensing devices, machine learning has become a valuable tool in enhancing biosensor performance. In this report, we review recent progress in the application of machine learning to biosensors. We discuss the potential benefits of using machine learning in biosensors, including improved sensitivity, selectivity, and accuracy. We also discuss the various machine learning techniques that have been applied to biosensors, including data preprocessing, feature extraction, and classification and data analysis models. The potential benefits of machine learning in biosensors are discussed, including the ability to analyze large and complex data sets, to detect subtle changes in biomolecular interactions, and to provide real-time monitoring of biological processes. The challenges associated with the integration of machine learning and biosensors are also addressed, including data availability, sensor performance, and computational requirements. We further highlight the challenges and opportunities for the integration of machine learning and biosensors, including the development of portable and low-cost biosensors, and the use of machine learning algorithms for efficient data analysis. Finally, we provide an outlook on future trends and emerging technologies in the field, including the use of artificial intelligence and deep learning algorithms for biosensors, and the potential for creating a fully autonomous biosensing system

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    Development of language modelling techniques for protein sequence analysis

    Get PDF
    Dissertação de mestrado em BioinformaticsNowadays, the ability to predict protein functions directly from amino-acid sequences alone remains a major biological challenge. The understanding of protein properties and functions is extremely important and can have a wide range of biotechnological and medical applications. Technological advances have led to an exponential growth of biological data challenging conventional analysis strategies. High-level representations from the field of deep learning can provide new alternatives to address these problems, particularly NLP methods, such as word embeddings, have shown particular success when applied for protein sequence analysis. Here, a module that eases the implementation of word embedding models toward protein representation and classification is presented. Furthermore, this module was integrated in the ProPythia framework, allowing to straightforwardly integrate WE representations with the training and testing of ML and DL models. This module was validated using two protein classification problems namely, identification of plant ubiquitylation sites and lysine crotonylation site prediction. This module was further used to explore enzyme functional annotation. Several WE were tested and fed to different ML and DL networks. Overall, WE achieved good results being even competitive with state-of-the-art models, reinforcing the idea that language based methods can be applied with success to a wide range of protein classification problems. This work presents a freely available tool to perform word embedding techniques for protein classification. The case studies presented reinforce the usability and importance of using NLP and ML in protein classification problems.Hoje em dia, a habilidade de prever a função de proteínas a partir apenas da sequências de amino-ácidos permanece um dos grandes desafios biológicos. A compreensão das propriedades e das funções das proteinas é de extrema importância e pode ter uma grande variedade de aplicações médicas e biotecnológicas. Os avanços nas tecnologia levaram a um crescimento exponencial de dados biológicos, desafiando as estratégias convencionais de análise. O campo do Deep Learning pode providenciar novas alternativas para atender à resolução destes problemas, em particular, os métodos de processamento de linguagem, como por exemplo word embeddings, mostraram especial sucesso quando aplicados para análise de sequências proteicas. Aqui, é apresentado um módulo que facilita a implementação de modelos de “word embedding” para representação e classificação de proteínas. Além disso, este módulo foi integrado na framework ProPythia, permitindo integrar diretamente as representações WE com o treino e teste de modelos ML e DL. Este módulo foi validado usando dois problemas de classificação de proteínas, identificação de locais de ubiquitilação de plantas e previsão de locais de crotonilação de lisinas. Este módulo foi usado também para explorar a anotação funcional de enzimas. Vários WE foram testados e utilizados em diferentes redes ML e DL. No geral, as técnicas de WE obtiveram bons resultados sendo competitivas, mesmo com modelos descritos no estado da arte, reforçando a ideia de que métodos baseados em linguagem podem ser aplicados com sucesso a uma ampla gama de problemas de classificação de proteínas. Este trabalho apresenta uma ferramenta para realizar técnicas de word embedding para classificação de proteínas. Os caso de estudo apresentados reforçam a usabilidade e importância do uso de NLP e ML em problemas de classificação de proteínas

    Leveraging Structural Flexibility to Predict Protein Function

    Get PDF
    Proteins are essentially versatile and flexible molecules and understanding protein function plays a fundamental role in understanding biological systems. Protein structure comparisons are widely used for revealing protein function. However,with rigidity or partial rigidity assumption, most existing comparison methods do not consider conformational flexibility in protein structures. To address this issue, this thesis seeks to develop algorithms for flexible structure comparisons to predict one specific aspect of protein function, binding specificity. Given conformational samples as flexibility representation, we focus on two predictive problems related to specificity: aggregate prediction and individual prediction.For aggregate prediction, we have designed FAVA (Flexible Aggregate Volumetric Analysis). FAVA is the first conformationally general method to compare proteins with identical folds but different specificities. FAVA is able to correctly categorize members of protein superfamilies and to identify influential amino acids that cause different specificities. A second method PEAP (Point-based Ensemble for Aggregate Prediction) employs ensemble clustering techniques from many base clustering to predict binding specificity. This method incorporates structural motions of functional substructures and is capable of mitigating prediction errors.For individual prediction, the first method is an atomic point representation for representing flexibilities in the binding cavity. This representation is able to predict binding specificity on each protein conformation with high accuracy, and it is the first to analyze maps of binding cavity conformations that describe proteins with different specificities. Our second method introduces a volumetric lattice representation. This representation localizes solvent-accessible shape of the binding cavity by computing cavity volume in each user-defined space. It proves to be more informative than point-based representations. Last but not least, we discuss a structure-independent representation. This representation builds a lattice model on protein electrostatic isopotentials. This is the first known method to predict binding specificity explicitly from the perspective of electrostatic fields.The methods presented in this thesis incorporate the variety of protein conformations into the analysis of protein ligand binding, and provide more views on flexible structure comparisons and structure-based function annotation of molecular design
    corecore