53 research outputs found

    Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

    Get PDF
    Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available.Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining

    Machine learning-based somatic variant calling in cell-free DNA of metastatic breast cancer patients using large NGS panels

    Get PDF
    Abstract Next generation sequencing of cell-free DNA (cfDNA) is a promising method for treatment monitoring and therapy selection in metastatic breast cancer (MBC). However, distinguishing tumor-specific variants from sequencing artefacts and germline variation with low false discovery rate is challenging when using large targeted sequencing panels covering many tumor suppressor genes. To address this, we built a machine learning model to remove false positive variant calls and augmented it with additional filters to ensure selection of tumor-derived variants. We used cfDNA of 70 MBC patients profiled with both the small targeted Oncomine breast panel (Thermofisher) and the much larger Qiaseq Human Breast Cancer Panel (Qiagen). The model was trained on the panels’ common regions using Oncomine hotspot mutations as ground truth. Applied to Qiaseq data, it achieved 35% sensitivity and 36% precision, outperforming basic filtering. For 20 patients we used germline DNA to filter for somatic variants and obtained 245 variants in total, while our model found seven variants, of which six were also detected using the germline strategy. In ten tumor-free individuals, our method detected in total one (potentially germline) variant, in contrast to 521 variants detected without our model. These results indicate that our model largely detects somatic variants

    Hematopoiesis of a Healthy Supercentenarian is Dominated by One Myeloid-Biased Stem Cell Clone for at least 9 Years

    No full text
    Electrical Engineering, Mathematics and Computer ScienceIntelligent SystemsPattern Recognition and Bioinformatic

    Protein function prediction using pre-trained ELMO embeddings

    No full text
    This dataset includes the data for training the protein function prediction models at github.com/stamakro/GCN-for-Structure-and-Function. For each protein, a pickle file is provided, containing its sequence, ELMo embedding and labels. It also includes the weights of the trained models that can be applied directly. README file at github.com/stamakro/GCN-for-Structure-and-Functio

    What does that gene do?: Gene function prediction by machine learning with applications to plants

    No full text
    Billions of people world-wide rely on plant-based food for their daily energy intake. As global warming and the spread of diseases (such as the banana Panama disease) is substantially hindering the cultivation of plants, the need to develop temperature- and/or disease-resistant varieties is getting more and more pressing. The field of plant breeding has been revolutionized by the use of molecular biology methods, such as DNA and RNA sequencing, which substantially accelerated the finding of genes that are likely to influence a trait of interest. The outcome of such experiments is typically a long list of candidate genes whose involvement in the trait needs to be experimentally validated. Prioritizing these experiments, i.e. testing the most promising genes first, can save a lot of time, effort and money, but is often hindered by the fact that the cellular roles (functions) of plant genes and the corresponding proteins is often unknown. Experimentally discovering the functions of genes is equally time-consuming and costly, so it is crucial to have computer algorithms that can automatically predict gene or protein functionswith high accuracy. After decades of research on this field, considerable progress has been made, but we are still far from a widely-acceptable and accurate solution to the problem.This thesis explores different research directions to improve protein function prediction, by developing new machine learning algorithms. These directions include new ways to represent proteins, exploiting semantic relationships among functions, and function-specific feature selection. The thesis also deals with the problem of missing protein interaction data for non-model species and quantifies its effect on protein function prediction. All in all, it provides novel insights to the problem that future work can build upon to lead to new breakthroughs.Pattern Recognition and Bioinformatic

    A thorough analysis of the contribution of experimental, derived and sequence-based predicted protein-protein interactions for functional annotation of proteins

    No full text
    Physical interaction between two proteins is strong evidence that the proteins are involved in the same biological process, making Protein-Protein Interaction (PPI) networks a valuable data resource for predicting the cellular functions of proteins. However, PPI networks are largely incomplete for non-model species. Here, we tested to what extent these incomplete networks are still useful for genome-wide function prediction. We used two network-based classifiers to predict Biological Process Gene Ontology terms from protein interaction data in four species: Saccharomyces cerevisiae, Escherichia coli, Arabidopsis thaliana and Solanum lycopersicum (tomato). The classifiers had reasonable performance in the well-studied yeast, but performed poorly in the other species. We showed that this poor performance can be considerably improved by adding edges predicted from various data sources, such as text mining, and that associations from the STRING database are more useful than interactions predicted by a neural network from sequence-based features.</p

    Metric learning on expression data for gene function prediction

    No full text
    Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results: To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa.Pattern Recognition and Bioinformatic

    Improving protein function prediction using protein sequence and GO-term similarities

    No full text
    Motivation: Most automatic functional annotation methods assign Gene Ontology (GO) terms to proteins based on annotations of highly similar proteins. We advocate that proteins that are less similar are still informative. Also, despite their simplicity and structure, GO terms seem to be hard for computers to learn, in particular the Biological Process ontology, which has the most terms (&gt;29 000). We propose to use Label-Space Dimensionality Reduction (LSDR) techniques to exploit the redundancy of GO terms and transform them into a more compact latent representation that is easier to predict. Results: We compare proteins using a sequence similarity profile (SSP) to a set of annotated training proteins. We introduce two new LSDR methods, one based on the structure of the GO, and one based on semantic similarity of terms. We show that these LSDR methods, as well as three existing ones, improve the Critical Assessment of Functional Annotation performance of several function prediction algorithms. Cross-validation experiments on Arabidopsis thaliana proteins pinpoint the superiority of our GO-aware LSDR over generic LSDR. Our experiments on A.thaliana proteins show that the SSP representation in combination with a kNN classifier outperforms state-of-the-art and baseline methods in terms of cross-validated F-measure. Availability and implementation: Source code for the experiments is available at https://github.com/stamakro/SSP-LSDR.Pattern Recognition and Bioinformatic

    Automatic gene function prediction in the 2020’s

    No full text
    The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.Pattern Recognition and Bioinformatic

    A novel computerized tool to stratify risk in carotid atherosclerosis using kinematic features of the arterial wall

    No full text
    Valid characterization of carotid atherosclerosis (CA) is a crucial public health issue, which would limit the major risks held by CA for both patient safety and state economies. This paper investigated the unexplored potential of kinematic features in assisting the diagnostic decision for CA in the framework of a computer-aided diagnosis (CAD) tool. Tothis end, 15CAD schemes were designed and were fed with a wide variety of kinematic features of the atherosclerotic plaque and the arterial wall adjacent to the plaque for 56 patients from two different hospitals. The CAD schemes were benchmarked in terms of their ability to discriminate between symptomatic and asymptomatic patients and the combination of the Fisher discriminant ratio, as a feature-selection strategy, and support vector machines, in the classification module, was revealedas the optimal motion-based CAD tool. The particular CAD tool was evaluated with severalcross-validationstrategies and yielded higher than 88% classification accuracy; the texture-based CAD performance in the same dataset was 80%. The incorporation of kinematic features of the arterial wall in CAD seems to have a particularly favorable impact on the performance of image-datadriven diagnosis for CA, which remains to be further elucidated in future prospective studies on large datasets. 2168-2194 © 2014 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
    corecore