4,102 research outputs found
Convolutional LSTM Networks for Subcellular Localization of Proteins
Machine learning is widely used to analyze biological sequence data.
Non-sequential models such as SVMs or feed-forward neural networks are often
used although they have no natural way of handling sequences of varying length.
Recurrent neural networks such as the long short term memory (LSTM) model on
the other hand are designed to handle sequences. In this study we demonstrate
that LSTM networks predict the subcellular location of proteins given only the
protein sequence with high accuracy (0.902) outperforming current state of the
art algorithms. We further improve the performance by introducing convolutional
filters and experiment with an attention mechanism which lets the LSTM focus on
specific parts of the protein. Lastly we introduce new visualizations of both
the convolutional filters and the attention mechanisms and show how they can be
used to extract biological relevant knowledge from the LSTM networks
De novo structural modeling and computational sequence analysis of a bacteriocin protein isolated from Rhizobium leguminosarum bv. viciae strain LC-31
Bacteriocins produced by different groups of bacteria are ribosomally synthesized peptides or proteins with antimicrobial and specific antagonistic bacterial interaction activity. Rhizobium leguminosarum is a Gram-negative soil bacterium which plays an important role in nitrogen fixation in leguminose plants. Bacteriocins produced by different strains of R. leguminosarum are known to impart antagonistic effects on other closely related strains. Recently, a bacteriocin gene was isolated from R. leguminosarum bv. viceae strain LC-31. Our study was aimed towards computational proteomic analysis and 3D structural modeling of this novel bacteriocin protein encoded by the earlier aforementioned gene. Different bioinformatics tools and machine learning techniques were used for protein structural classification. De novo protein modeling was performed by using I-TASSER server. The final model obtained was accessed by PROCHECK and DFIRE2, which confirmed that the final model is reliable. Until complete biochemical and structural data of bacteriocin protein produced by R. leguminosarum bv. viceae strain LC-31 are determined by experimental means, this model can serve as a valuable reference for characterizing this multifunctional protein.Key words: Bacteriocin, rhizobium, protein modelling, nodulation, symbiosis, nitrogen fixation
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
Evaluation of secretion prediction highlights differing approaches needed for oomycete and fungal effectors
© 2015 Sperschneider, Williams, Hane, Singh and Taylor. The steadily increasing number of sequenced fungal and oomycete genomes has enabled detailed studies of how these eukaryotic microbes infect plants and cause devastating losses in food crops. During infection, fungal and oomycete pathogens secrete effector molecules which manipulate host plant cell processes to the pathogen's advantage. Proteinaceous effectors are synthesized intracellularly and must be externalized to interact with host cells. Computational prediction of secreted proteins from genomic sequences is an important technique to narrow down the candidate effector repertoire for subsequent experimental validation. In this study, we benchmark secretion prediction tools on experimentally validated fungal and oomycete effectors. We observe that for a set of fungal SwissProt protein sequences, SignalP 4 and the neural network predictors of SignalP 3 (D-score) and SignalP 2 perform best. For effector prediction in particular, the use of a sensitive method can be desirable to obtain the most complete candidate effector set. We show that the neural network predictors of SignalP 2 and 3, as well as TargetP were the most sensitive tools for fungal effector secretion prediction, whereas the hidden Markov model predictors of SignalP 2 and 3 were the most sensitive tools for oomycete effectors. Thus, previous versions of SignalP retain value for oomycete effector prediction, as the current version, SignalP 4, was unable to reliably predict the signal peptide of the oomycete Crinkler effectors in the test set. Our assessment of subcellular localization predictors shows that cytoplasmic effectors are often predicted as not extracellular. This limits the reliability of secretion predictions that depend on these tools. We present our assessment with a view to informing future pathogenomics studies and suggest revised pipelines for secretion prediction to obtain optimal effector predictions in fungi and oomycetes
Identification And Functional Characterization Of Plant Small Secreted Proteins During Arbuscular Mycorrhizal Symbiosis
Plant small secreted proteins (SSPs) are sequences of 50 – 250 amino acids in size which are transported out of cells to fulfill multiple functions related to plant growth and development and response to various stresses. With the development of more accurate and affordable genome sequencing technology, an increasing number of SSPs have been predicted using diverse computational tools based on machine learning. Although experimentally validated plant SSPs are still limited, some studies have reported that plant SSPs can be induced and involved in mutualistic relationships between plants and microbes. In Chapter I, known SSPs and their functions in various plant species are reviewed. Additionally, current computational tools and experimental methods that have been widely applied to identify plant SSPs are summarized. A new, robust, and integrated pipeline to discover plant SSPs is proposed. Furthermore, strategies for elucidating the biological functions of SSPs in plants are discussed in Chapter I. Chapter II presents predicted SSPs from 60 plant species and elucidates the evolutionary convergence of changes in SSP sequences. Furthermore, the expression of SSPs induced by arbuscular mycorrhizal fungi (AMF) which correspond to the convergent abilityfor different plants to form mutualistic association with AMF are explored. Overall, this study provides insightful ideas to understand functions of plant SSPs that occur during symbiosis between plants and fungi
Deep Learning for Genomics: A Concise Overview
Advancements in genomic research such as high-throughput sequencing
techniques have driven modern genomic studies into "big data" disciplines. This
data explosion is constantly challenging conventional methods used in genomics.
In parallel with the urgent demand for robust algorithms, deep learning has
succeeded in a variety of fields such as vision, speech, and text processing.
Yet genomics entails unique challenges to deep learning since we are expecting
from deep learning a superhuman intelligence that explores beyond our knowledge
to interpret the genome. A powerful deep learning model should rely on
insightful utilization of task-specific knowledge. In this paper, we briefly
discuss the strengths of different deep learning models from a genomic
perspective so as to fit each particular task with a proper deep architecture,
and remark on practical considerations of developing modern deep learning
architectures for genomics. We also provide a concise review of deep learning
applications in various aspects of genomic research, as well as pointing out
potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning
Application
PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding
We are now witnessing significant progress of deep learning methods in a
variety of tasks (or datasets) of proteins. However, there is a lack of a
standard benchmark to evaluate the performance of different methods, which
hinders the progress of deep learning in this field. In this paper, we propose
such a benchmark called PEER, a comprehensive and multi-task benchmark for
Protein sEquence undERstanding. PEER provides a set of diverse protein
understanding tasks including protein function prediction, protein localization
prediction, protein structure prediction, protein-protein interaction
prediction, and protein-ligand interaction prediction. We evaluate different
types of sequence-based methods for each task including traditional feature
engineering approaches, different sequence encoding methods as well as
large-scale pre-trained protein language models. In addition, we also
investigate the performance of these methods under the multi-task learning
setting. Experimental results show that large-scale pre-trained protein
language models achieve the best performance for most individual tasks, and
jointly training multiple tasks further boosts the performance. The datasets
and source codes of this benchmark are all available at
https://github.com/DeepGraphLearning/PEER_BenchmarkComment: Accepted by NeurIPS 2022 Dataset and Benchmark Track. arXiv v2:
source code released; arXiv v1: release all benchmark result
- …