Search CORE

20 research outputs found

Investigating the biological relevance in trained embedding representations of protein sequences

Author: De Neve Wesley
Pan Xiaoyong
Saeys Yvan
Wang Xi
Zuallaert Jasper
Publication venue
Publication date: 01/01/2019
Field of study

As genome sequencing is becoming faster and cheaper, an abundance of DNA and protein sequence data is available. However, experimental annotation of structural or functional information develops at a much slower pace. Therefore, machine learning techniques have been widely adopted to make accurate predictions on unseen sequence data. In recent years, deep learning has been gaining popularity, as it allows for effective end-to-end learning. One consideration for its application on sequence data is the choice for a suitable and effective sequence representation strategy. In this paper, we investigate the significance of three common encoding schemes on the multi-label prediction problem of Gene Ontology (GO) term annotation, namely a one-hot encoding, an ad-hoc trainable embedding, and pre-trained protein vectors, using different hyper-parameters. We found that traditional unigram one-hot encodings achieved very good results, only slightly outperformed by unigram ad-hoc trainable embeddings and bigram pre-trained embeddings (by at most 3%for the F maxscore), suggesting the exploration of different encoding strategies to be potentially beneficial. Most interestingly, when analyzing and visualizing the trained embeddings, we found that biologically relevant (dis)similarities between amino acid n-grams were implicitly learned, which were consistent with their physiochemical properties

Ghent University Academic Bibliography

Few-shot learning using a small-sized dataset of high-resolution FUNDUS images for glaucoma diagnosis

Author: De Neve Wesley
Kim Mi Jung
Zuallaert Jasper
Publication venue
Publication date: 01/01/2017
Field of study

Crossref

Ghent University Academic Bibliography

Web Applicable Computer-aided Diagnosis of Glaucoma Using Deep Learning

Author: De Neve Wesley
Janssens Olivier
Kim Mijung
Park Ho-min
Van Hoecke Sofie
Zuallaert Jasper
Publication venue
Publication date: 01/01/2018
Field of study

Glaucoma is a major eye disease, leading to vision loss in the absence of proper medical treatment. Current diagnosis of glaucoma is performed by ophthalmologists who are often analyzing several types of medical images generated by different types of medical equipment. Capturing and analyzing these medical images is labor-intensive and expensive. In this paper, we present a novel computational approach towards glaucoma diagnosis and localization, only making use of eye fundus images that are analyzed by state-of-the-art deep learning techniques. Specifically, our approach leverages Convolutional Neural Networks (CNNs) and Gradient-weighted Class Activation Mapping (Grad-CAM) for glaucoma diagnosis and localization, respectively. Quantitative and qualitative results, as obtained for a small-sized dataset with no segmentation ground truth, demonstrate that the proposed approach is promising, for instance achieving an accuracy of 0.91

\pm0.02

and an ROC-AUC score of 0.94 for the diagnosis task. Furthermore, we present a publicly available prototype web application that integrates our predictive model, with the goal of making effective glaucoma diagnosis available to a wide audience.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:cs/010120

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Utilizing Mutations to Evaluate Interpretability of Neural Networks on Genomic Data

Author: Depuydt Stephen
Kang Solha
Ozbulak Utku
Vankerschaver Joris
Zuallaert Jasper
Publication venue
Publication date: 01/01/2022
Field of study

Even though deep neural networks (DNNs) achieve state-of-the-art results for a number of problems involving genomic data, getting DNNs to explain their decision-making process has been a major challenge due to their black-box nature. One way to get DNNs to explain their reasoning for prediction is via attribution methods which are assumed to highlight the parts of the input that contribute to the prediction the most. Given the existence of numerous attribution methods and a lack of quantitative results on the fidelity of those methods, selection of an attribution method for sequence-based tasks has been mostly done qualitatively. In this work, we take a step towards identifying the most faithful attribution method by proposing a computational approach that utilizes point mutations. Providing quantitative results on seven popular attribution methods, we find Layerwise Relevance Propagation (LRP) to be the most appropriate one for translation initiation, with LRP identifying two important biological features for translation: the integrity of Kozak sequence as well as the detrimental effects of premature stop codons.Comment: Accepted for publication at the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), Workshop on Learning Meaningful Representations of Life (LMRL

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Interpretable convolutional neural networks for effective translation initiation site prediction

Author: De Neve Wesley
Kim Mi Jung
Saeys Yvan
Zuallaert Jasper
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Thanks to rapidly evolving sequencing techniques, the amount of genomic data at our disposal is growing increasingly large. Determining the gene structure is a fundamental requirement to effectively interpret gene function and regulation. An important part in that determination process is the identification of translation initiation sites. In this paper, we propose a novel approach for automatic prediction of translation initiation sites, leveraging convolutional neural networks that allow for automatic feature extraction. Our experimental results demonstrate that we are able to improve the state-of-the-art approaches with a decrease of 75.2% in false positive rate and with a decrease of 24.5% in error rate on chosen datasets. Furthermore, an in-depth analysis of the decision-making process used by our predictive model shows that our neural network implicitly learns biologically relevant features from scratch, without any prior knowledge about the problem at hand, such as the Kozak consensus sequence, the influence of stop and start codons in the sequence and the presence of donor splice site patterns. In summary, our findings yield a better understanding of the internal reasoning of a convolutional neural network when applying such a neural network to genomic data

Ghent University Academic Bibliography

Web applicable computer-aided diagnosis of glaucoma using deep learning

Author: De Neve Wesley
Janssens Olivier
Kim Mi Jung
Park H.
Van Hoecke Sofie
Zuallaert Jasper
Publication venue
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography

ToxDL : deep learning using primary structure and domain embeddings for assessing protein toxicity

Author: Campos Elda Posada
De Neve Wesley
Marushchak Denys O.
Pan Xiaoyong
Shen Hong-Bin
Wang Xi
Zuallaert Jasper
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

Motivation: Genetically engineering food crops involves introducing proteins from other species into crop plant species or modifying already existing proteins with gene editing techniques. In addition, newly synthesized proteins can be used as therapeutic protein drugs against diseases. For both research and safety regulation purposes, being able to assess the potential toxicity of newly introduced/synthesized proteins is of high importance. Results: In this study, we present ToxDL, a deep learning-based approach for in silico prediction of protein toxicity from sequence alone. ToxDL consists of (i) a module encompassing a convolutional neural network that has been designed to handle variable-length input sequences, (ii) a domain2vec module for generating protein domain embeddings and (iii) an output module that classifies proteins as toxic or non-toxic, using the outputs of the two aforementioned modules. Independent test results obtained for animal proteins and cross-species transferability results obtained for bacteria proteins indicate that ToxDL outperforms traditional homology-based approaches and state-of-the-art machine-learning techniques. Furthermore, through visualizations based on saliency maps, we are able to verify that the proposed network learns known toxic motifs. Moreover, the saliency maps allow for directed in silico modification of a sequence, thus making it possible to alter its predicted protein toxicity

Crossref

Ghent University Academic Bibliography

Translation Initiation Site Prediction Using Deep Learning and Synthetic Datasets

Author: De Neve Wesley
Kabanga Espoir
Park Yunseol
Shim Hyunjin
Van Messem Arnout
Zuallaert Jasper
Publication venue
Publication date: 01/07/2021
Field of study

Ghent University Academic Bibliography

Open Repository and Bibliography - Liège

Automated learning of biologically relevant features from primary sequence using convolutional neural networks

Author: Zuallaert Jasper
Publication venue: Universiteit Gent. Faculteit Ingenieurswetenschappen en Architectuur
Publication date: 28/05/2023
Field of study

Ghent University Academic Bibliography

PhosphoLingo: protein language models for phosphorylation site prediction

Author: Bouwmeester Robbin
Callewaert Nico
Degroeve Sven
Ramasamy Pathmanaban
Zuallaert Jasper
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2022
Field of study

AbstractMotivationWith a regulatory impact on numerous biological processes, protein phosphorylation is one of the most studied post-translational modifications. Effective computational methods that provide a sequence-based prediction of probable phosphorylation sites are desirable to guide functional experiments or constrain search spaces for proteomics-based experimental pipelines. Currently, the most successful methods train deep learning models on amino acid composition representations. However, recently proposed protein language models provide enriched sequence representations that may contain higher-level pattern information on which more performant phosphorylation site predictions may be based.ResultsWe explored the applicability of protein language models to general phosphorylation site prediction. We found that training prediction models on top of protein language models yield a relative improvement of between 13.4% and 63.3% in terms of area under the precision-recall curve over the state-of-the-art predictors. Advanced model interpretation and model transferability experiments reveal that across models, protease-specific cleavage patterns give rise to a protease-specific training bias that results in an overly optimistic estimation of phosphorylation site prediction performance, an important caveat in the application of advanced machine learning approaches to protein modification prediction based on proteomics data.Availability and implementationPhosphoLingo code, datasets, and predictions are available athttps://github.com/jasperzuallaert/[email protected],[email protected] informationSupplementary materials are available atbioRxiv</jats:sec

Ghent University Academic Bibliography