20 research outputs found
Investigating the biological relevance in trained embedding representations of protein sequences
As genome sequencing is becoming faster and cheaper, an abundance of DNA and protein sequence data is available. However, experimental annotation of structural or functional information develops at a much slower pace. Therefore, machine learning techniques have been widely adopted to make accurate predictions on unseen sequence data. In recent years, deep learning has been gaining popularity, as it allows for effective end-to-end learning. One consideration for its application on sequence data is the choice for a suitable and effective sequence representation strategy. In this paper, we investigate the significance of three common encoding schemes on the multi-label prediction problem of Gene Ontology (GO) term annotation, namely a one-hot encoding, an ad-hoc trainable embedding, and pre-trained protein vectors, using different hyper-parameters. We found that traditional unigram one-hot encodings achieved very good results, only slightly outperformed by unigram ad-hoc trainable embeddings and bigram pre-trained embeddings (by at most 3%for the F maxscore), suggesting the exploration of different encoding strategies to be potentially beneficial. Most interestingly, when analyzing and visualizing the trained embeddings, we found that biologically relevant (dis)similarities between amino acid n-grams were implicitly learned, which were consistent with their physiochemical properties
Web Applicable Computer-aided Diagnosis of Glaucoma Using Deep Learning
Glaucoma is a major eye disease, leading to vision loss in the absence of
proper medical treatment. Current diagnosis of glaucoma is performed by
ophthalmologists who are often analyzing several types of medical images
generated by different types of medical equipment. Capturing and analyzing
these medical images is labor-intensive and expensive. In this paper, we
present a novel computational approach towards glaucoma diagnosis and
localization, only making use of eye fundus images that are analyzed by
state-of-the-art deep learning techniques. Specifically, our approach leverages
Convolutional Neural Networks (CNNs) and Gradient-weighted Class Activation
Mapping (Grad-CAM) for glaucoma diagnosis and localization, respectively.
Quantitative and qualitative results, as obtained for a small-sized dataset
with no segmentation ground truth, demonstrate that the proposed approach is
promising, for instance achieving an accuracy of 0.91 and an ROC-AUC
score of 0.94 for the diagnosis task. Furthermore, we present a publicly
available prototype web application that integrates our predictive model, with
the goal of making effective glaucoma diagnosis available to a wide audience.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
arXiv:cs/010120
Utilizing Mutations to Evaluate Interpretability of Neural Networks on Genomic Data
Even though deep neural networks (DNNs) achieve state-of-the-art results for
a number of problems involving genomic data, getting DNNs to explain their
decision-making process has been a major challenge due to their black-box
nature. One way to get DNNs to explain their reasoning for prediction is via
attribution methods which are assumed to highlight the parts of the input that
contribute to the prediction the most. Given the existence of numerous
attribution methods and a lack of quantitative results on the fidelity of those
methods, selection of an attribution method for sequence-based tasks has been
mostly done qualitatively. In this work, we take a step towards identifying the
most faithful attribution method by proposing a computational approach that
utilizes point mutations. Providing quantitative results on seven popular
attribution methods, we find Layerwise Relevance Propagation (LRP) to be the
most appropriate one for translation initiation, with LRP identifying two
important biological features for translation: the integrity of Kozak sequence
as well as the detrimental effects of premature stop codons.Comment: Accepted for publication at the 36th Conference on Neural Information
Processing Systems (NeurIPS 2022), Workshop on Learning Meaningful
Representations of Life (LMRL
Interpretable convolutional neural networks for effective translation initiation site prediction
Thanks to rapidly evolving sequencing techniques, the amount of genomic data at our disposal is growing increasingly large. Determining the gene structure is a fundamental requirement to effectively interpret gene function and regulation. An important part in that determination process is the identification of translation initiation sites. In this paper, we propose a novel approach for automatic prediction of translation initiation sites, leveraging convolutional neural networks that allow for automatic feature extraction. Our experimental results demonstrate that we are able to improve the state-of-the-art approaches with a decrease of 75.2% in false positive rate and with a decrease of 24.5% in error rate on chosen datasets. Furthermore, an in-depth analysis of the decision-making process used by our predictive model shows that our neural network implicitly learns biologically relevant features from scratch, without any prior knowledge about the problem at hand, such as the Kozak consensus sequence, the influence of stop and start codons in the sequence and the presence of donor splice site patterns. In summary, our findings yield a better understanding of the internal reasoning of a convolutional neural network when applying such a neural network to genomic data
ToxDL : deep learning using primary structure and domain embeddings for assessing protein toxicity
Motivation: Genetically engineering food crops involves introducing proteins from other species into crop plant species or modifying already existing proteins with gene editing techniques. In addition, newly synthesized proteins can be used as therapeutic protein drugs against diseases. For both research and safety regulation purposes, being able to assess the potential toxicity of newly introduced/synthesized proteins is of high importance. Results: In this study, we present ToxDL, a deep learning-based approach for in silico prediction of protein toxicity from sequence alone. ToxDL consists of (i) a module encompassing a convolutional neural network that has been designed to handle variable-length input sequences, (ii) a domain2vec module for generating protein domain embeddings and (iii) an output module that classifies proteins as toxic or non-toxic, using the outputs of the two aforementioned modules. Independent test results obtained for animal proteins and cross-species transferability results obtained for bacteria proteins indicate that ToxDL outperforms traditional homology-based approaches and state-of-the-art machine-learning techniques. Furthermore, through visualizations based on saliency maps, we are able to verify that the proposed network learns known toxic motifs. Moreover, the saliency maps allow for directed in silico modification of a sequence, thus making it possible to alter its predicted protein toxicity
PhosphoLingo: protein language models for phosphorylation site prediction
AbstractMotivationWith a regulatory impact on numerous biological processes, protein phosphorylation is one of the most studied post-translational modifications. Effective computational methods that provide a sequence-based prediction of probable phosphorylation sites are desirable to guide functional experiments or constrain search spaces for proteomics-based experimental pipelines. Currently, the most successful methods train deep learning models on amino acid composition representations. However, recently proposed protein language models provide enriched sequence representations that may contain higher-level pattern information on which more performant phosphorylation site predictions may be based.ResultsWe explored the applicability of protein language models to general phosphorylation site prediction. We found that training prediction models on top of protein language models yield a relative improvement of between 13.4% and 63.3% in terms of area under the precision-recall curve over the state-of-the-art predictors. Advanced model interpretation and model transferability experiments reveal that across models, protease-specific cleavage patterns give rise to a protease-specific training bias that results in an overly optimistic estimation of phosphorylation site prediction performance, an important caveat in the application of advanced machine learning approaches to protein modification prediction based on proteomics data.Availability and implementationPhosphoLingo code, datasets, and predictions are available athttps://github.com/jasperzuallaert/[email protected],[email protected] informationSupplementary materials are available atbioRxiv</jats:sec