5 research outputs found

    Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

    Get PDF
    Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis, and Karin Verspoo

    Variant type is associated with disease characteristics in SDHB, SDHC and SDHD-linked phaeochromocytoma-paraganglioma

    No full text
    Background Pathogenic germline variants in subunits of succinate dehydrogenase (SDHB, SDHC and SDHD) are broadly associated with disease subtypes of phaeochromocytoma-paraganglioma (PPGL) syndrome. Our objective was to investigate the role of variant type (ie, missense vs truncating) in determining tumour phenotype.Methods Three independent datasets comprising 950 PPGL and head and neck paraganglioma (HNPGL) patients were analysed for associations of variant type with tumour type and age-related tumour risk. All patients were carriers of pathogenic germline variants in the SDHB, SDHC or SDHD genes.Results Truncating SDH variants were significantly over-represented in clinical cases compared with missense variants, and carriers of SDHD truncating variants had a significantly higher risk for PPGL (p<0.001), an earlier age of diagnosis (p<0.0001) and a greater risk for PPGL/HNPGL comorbidity compared with carriers of missense variants. Carriers of SDHB truncating variants displayed a trend towards increased risk of PPGL, and all three SDH genes showed a trend towards over-representation of missense variants in HNPGL cases. Overall, variant types conferred PPGL risk in the (highest-to-lowest) sequence SDHB truncating, SDHB missense, SDHD truncating and SDHD missense, with the opposite pattern apparent for HNPGL (p<0.001).Conclusions SDHD truncating variants represent a distinct group, with a clinical phenotype reminiscent of but not identical to SDHB. We propose that surveillance and counselling of carriers of SDHD should be tailored by variant type. The clinical impact of truncating SDHx variants is distinct from missense variants and suggests that residual SDH protein subunit function determines risk and site of disease.Diabetes mellitus: pathophysiological changes and therap
    corecore