Search CORE

13 research outputs found

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

Author: Davis M.J.
Elangovan A.
Li Y.
Pires D.E.V.
Verspoor K.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis, and Karin Verspoo

arXiv.org e-Print Archive

Adelaide Research & Scholarship

PubMed Central

University of Melbourne Institutional Repository

Building a glaucoma interaction network using a text mining approach

Author: A Bauer-Mehren
A Bauer-Mehren
A Ekbal
A Fader
A Hamosh
A Izzotti
A Malhotra
A Ozgur
A Skusa
A-L Barabási
C Quan
C Stark
CC van der Eijk
D Tabas-Madrid
DR Swanson
DY Wang
E Ravasz
F Rinaldi
G Beidoe
G Villarreal Jr
H Chen
H Mi
JD Kim
JD Wren
K Basu
KA Gray
L Tanabe
M Abulaish
M Bastian
M He
M Krallinger
M Rebhan
M-C De Marneffe
Maha Soliman
MW Stewart
N Nguyen
Nigel G. F. Cooper
O Etzioni
Olfa Nasraoui
P Carmona-Saez
P Shannon
P Srinivasan
R Jelier
R Lambiotte
RJ Mooney
S Chtioui
S Kingman
S Nakatake
S Pyysalo
S Pyysalo
SH Yook
T HU
T Wecker
VD Blondel
W da Huang
W da Huang
W Rokicki
X Wu
Y Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Protein Ontology: Enhancing and scaling up the representation of protein entities

Author: Arighi Cecilia N.
Blake Judith A.
Bona Jonathan
Chen Chuming
Chen Sheng-Chih
Christie Karen R.
Cowart Julie
D'Eustachio Peter
Diehl Alexander D.
Drabkin Harold J.
Duncan William D.
Huang Hongzhan
Natale Darren A.
Ren Jia
Ross Karen
Ruttenberg Alan
Publication venue
Publication date: 01/01/2017
Field of study

The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities

PhilPapers

Text Mining for Protein-Protein Docking

Author: Badal Varsha Dave
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2018
Field of study

Scientific publications are a rich but underutilized source of structural and functional information on proteins and protein interactions. Although scientific literature is intended for human audience, text mining makes it amenable to algorithmic processing. It can focus on extracting information relevant to protein binding modes, providing specific residues that are likely be at the binding site for a given pair of proteins. The knowledge of such residues is a powerful guide for the structural modeling of protein-protein complexes. This work combines and extends two well-established areas of research: the non-structural identification of protein-protein interactors, and structure-based detection of functional (small-ligand) sites on proteins. Text-mining based constraints for protein-protein docking is a unique research direction, which has not been explored prior to this study. Although text mining by itself is unlikely to produce docked models, it is useful in scoring of the docking predictions. Our results show that despite presence of false positives, text mining significantly improves the docking quality. To purge false positives in the mined residues, along with the basic text-mining, this work explores enhanced text mining techniques, using various language processing tools, from simple dictionaries, to WordNet (a generic word ontology), parse trees, word vectors and deep recursive neural networks. The results significantly increase confidence in the generated docking constraints and provide guidelines for the future development of this modeling approach. With the rapid growth of the body of publicly available biomedical literature, and new evolving text-mining methodologies, the approach will become more powerful and adequate to the needs of biomedical community

KU ScholarWorks

Using R to develop a corpus of full-text journal articles

Author: Anderson Billie
Bani-Yaghoub Majid
Curtis Scott
Kantheti Vagmi
Publication venue: 'Brazilian Journal of Information Science: Research Trends'
Publication date: 01/01/2023
Field of study

https://doi.org/10.1177/01655515231171362This is the original proof of the accepted version of article. This has been accepted for print publication, and the date of OnlineFirst availability was July 14, 2023.No derivatives allowed.Over the past two decades, databases and the tools to access them in a simple manner have become increasingly available, allowing historical and modern-day topics to be merged and studied. Throughout the recent COVID-19 pandemic, for example, many researchers have reflected on whether any lessons learned from the Spanish flu pandemic of 1918 could have been helpful in the present pandemic. This study developed a methodology needed to create a full-text corpus to answer this question. Most studies using text-mining applications rarely use full-text journal articles. This article presents a methodology used to develop a full-text journal article corpus using the R fulltext package. Using the proposed methodology, 2743 full-text journal articles were obtained. The aim of this article is to provide a methodology and supplementary codes for researchers to use the R fulltext package to curate a full-text journal corpus

University of Missouri: MOspace

A human kinase yeast array for the identification of kinases modulating phosphorylation dependent protein protein interactions

Author: Benlasfer N.
Jehle S.
Kunowska N.
Stelzl U.
Wahl M.C.
Weber G.
Woodsmith J.
Publication venue: 'EMBO'
Publication date: 01/01/2022
Field of study

Protein kinases play an important role in cellular signaling pathways and their dysregulation leads to multiple diseases, making kinases prime drug targets. While more than 500 human protein kinases are known to collectively mediate phosphorylation of over 290,000 S T Y sites, the activities have been characterized only for a minor, intensively studied subset. To systematically address this discrepancy, we developed a human kinase array in Saccharomyces cerevisiae as a simple readout tool to systematically assess kinase activities. For this array, we expressed 266 human kinases in four different S. amp; 8201;cerevisiae strains and profiled ectopic growth as a proxy for kinase activity across 33 conditions. More than half of the kinases showed an activity dependent phenotype across many conditions and in more than one strain. We then employed the kinase array to identify the kinase s that can modulate protein protein interactions PPIs . Two characterized, phosphorylation dependent PPIs with unknown kinase substrate relationships were analyzed in a phospho yeast two hybrid assay. CK2 amp; 945;1 and SGK2 kinases can abrogate the interaction between the spliceosomal proteins AAR2 and PRPF8, and NEK6 kinase was found to mediate the estrogen receptor ER amp; 945; interaction with 14 3 3 proteins. The human kinase yeast array can thus be used for a variety of kinase activity dependent readout

HZB Repository

PubMed Central

Unmasking The Language Of Science Through Textual Analyses On Biomedical Preprints And Published Papers

Author: Nicholson David
Publication venue: ScholarlyCommons
Publication date: 01/01/2022
Field of study

Scientific communication is essential for science as it enables the field to grow. This task is often accomplished through a written form such as preprints and published papers. We can obtain a high-level understanding of science and how scientific trends adapt over time by analyzing these resources. This thesis focuses on conducting multiple analyses using biomedical preprints and published papers. In Chapter 2, we explore the language contained within preprints and examine how this language changes due to the peer-review process. We find that token differences between published papers and preprints are stylistically based, suggesting that peer-review results in modest textual changes. We also discovered that preprints are eventually published and adopted quickly within the life science community. Chapter 3 investigates how biomedical terms and tokens change their meaning and usage through time. We show that multiple machine learning models can correct for the latent variation contained within the biomedical text. Also, we provide the scientific community with a listing of over 43,000 potential change points. Tokens with notable changepoints such as “sars” and “cas9” appear within our listing, providing some validation for our approach. In Chapter 4, we use the weak supervision paradigm to examine the possibility of speeding up the labeling function generation process for multiple biomedical relationship types. We found that the language used to describe a biomedical relationship is often distinct, leading to a modest performance in terms of transferability. An exception to this trend is Compound-binds-Gene and Gene-interacts-Gene relationship types

ScholarlyCommons@Penn

Recommended from our members

Towards systems pharmacology models of druggable targets and disease mechanisms

Author: Knight-Schrijver Vincent
Publication venue: University of Cambridge
Publication date: 20/02/2019
Field of study

The development of essential medicines is being slowed by a lack of efficiency in drug development as ninety per cent of drugs fail at some stage during clinical evaluation. This attrition in drug development is seen not because of a reduction in pharmaceutical research expenditure nor is it caused by a declining understanding of biology, if anything, these are both increasing. Instead, drugs are failing because we are unable to effectively predict how they will work before they are given to patients. This is due to limitations of the current methods used to evaluate a drug’s toxicity and efficacy prior to its development. Quite simply, these methods do not account for the full complexity of biology in humans. Systems pharmacology models are a likely candidate for increasing the efficiency of drug discovery as they seek to comprehensively model the fundamental biology of disease mechanisms in a quantit- ative manner. They are computational models, designed and hailed as a strategy for making well-informed and cost effective decisions on drug viability and target druggability and therefore attempt to reduce this time-consuming and costly attrition. Using text mining and text classification I present a growing landscape of systems pharmacology models in literature growing from humble roots because of step-wise increases in our understanding of biology. Furthermore, I develop a case for the capability of systems pharmacology models in making predictions by constructing a model of interleukin-6 signalling for rheumatoid arthritis. This model shows that druggable target selection is not necessarily an intuitive task as it results in an emergent but unanswered hypothesis for safety concerns in a monoclonal antibody. Finally, I show that predictive classification models can also be used to explore gene expression data in a novel work flow by attempting to predict patient response classes to an influenza vaccine.Funded by the BBSRC and GlaxoSmithKline as part of an industrial CASE studentship

Apollo (Cambridge)

Knowledge-driven entity recognition and disambiguation in biomedical text

Author: Siu Amy
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2017
Field of study

Entity recognition and disambiguation (ERD) for the biomedical domain are notoriously difficult problems due to the variety of entities and their often long names in many variations. Existing works focus heavily on the molecular level in two ways. First, they target scientific literature as the input text genre. Second, they target single, highly specialized entity types such as chemicals, genes, and proteins. However, a wealth of biomedical information is also buried in the vast universe of Web content. In order to fully utilize all the information available, there is a need to tap into Web content as an additional input. Moreover, there is a need to cater for other entity types such as symptoms and risk factors since Web content focuses on consumer health. The goal of this thesis is to investigate ERD methods that are applicable to all entity types in scientific literature as well as Web content. In addition, we focus on under-explored aspects of the biomedical ERD problems -- scalability, long noun phrases, and out-of-knowledge base (OOKB) entities. This thesis makes four main contributions, all of which leverage knowledge in UMLS (Unified Medical Language System), the largest and most authoritative knowledge base (KB) of the biomedical domain. The first contribution is a fast dictionary lookup method for entity recognition that maximizes throughput while balancing the loss of precision and recall. The second contribution is a semantic type classification method targeting common words in long noun phrases. We develop a custom set of semantic types to capture word usages; besides biomedical usage, these types also cope with non-biomedical usage and the case of generic, non-informative usage. The third contribution is a fast heuristics method for entity disambiguation in MEDLINE abstracts, again maximizing throughput but this time maintaining accuracy. The fourth contribution is a corpus-driven entity disambiguation method that addresses OOKB entities. The method first captures the entities expressed in a corpus as latent representations that comprise in-KB and OOKB entities alike before performing entity disambiguation.Die Erkennung und Disambiguierung von Entitäten für den biomedizinischen Bereich stellen, wegen der vielfältigen Arten von biomedizinischen Entitäten sowie deren oft langen und variantenreichen Namen, große Herausforderungen dar. Vorhergehende Arbeiten konzentrieren sich in zweierlei Hinsicht fast ausschließlich auf molekulare Entitäten. Erstens fokussieren sie sich auf wissenschaftliche Publikationen als Genre der Eingabetexte. Zweitens fokussieren sie sich auf einzelne, sehr spezialisierte Entitätstypen wie Chemikalien, Gene und Proteine. Allerdings bietet das Internet neben diesen Quellen eine Vielzahl an Inhalten biomedizinischen Wissens, das vernachlässigt wird. Um alle verfügbaren Informationen auszunutzen besteht der Bedarf weitere Internet-Inhalte als zusätzliche Quellen zu erschließen. Außerdem ist es auch erforderlich andere Entitätstypen wie Symptome und Risikofaktoren in Betracht zu ziehen, da diese für zahlreiche Inhalte im Internet, wie zum Beispiel Verbraucherinformationen im Gesundheitssektor, relevant sind. Das Ziel dieser Dissertation ist es, Methoden zur Erkennung und Disambiguierung von Entitäten zu erforschen, die alle Entitätstypen in Betracht ziehen und sowohl auf wissenschaftliche Publikationen als auch auf andere Internet-Inhalte anwendbar sind. Darüber hinaus setzen wir Schwerpunkte auf oft vernachlässigte Aspekte der biomedizinischen Erkennung und Disambiguierung von Entitäten, nämlich Skalierbarkeit, lange Nominalphrasen und fehlende Entitäten in einer Wissensbank. In dieser Hinsicht leistet diese Dissertation vier Hauptbeiträge, denen allen das Wissen von UMLS (Unified Medical Language System), der größten und wichtigsten Wissensbank im biomedizinischen Bereich, zu Grunde liegt. Der erste Beitrag ist eine schnelle Methode zur Erkennung von Entitäten mittels Lexikonabgleich, welche den Durchsatz maximiert und gleichzeitig den Verlust in Genauigkeit und Trefferquote (precision and recall) balanciert. Der zweite Beitrag ist eine Methode zur Klassifizierung der semantischen Typen von Nomen, die sich auf gebräuchliche Nomen von langen Nominalphrasen richtet und auf einer selbstentwickelten Sammlung von semantischen Typen beruht, die die Verwendung der Nomen erfasst. Neben biomedizinischen können diese Typen auch nicht-biomedizinische und allgemeine, informationsarme Verwendungen behandeln. Der dritte Beitrag ist eine schnelle Heuristikmethode zur Disambiguierung von Entitäten in MEDLINE Kurzfassungen, welche den Durchsatz maximiert, aber auch die Genauigkeit erhält. Der vierte Beitrag ist eine korpusgetriebene Methode zur Disambiguierung von Entitäten, die speziell fehlende Entitäten in einer Wissensbank behandelt. Die Methode wandelt erst die Entitäten, die in einem Textkorpus ausgedrückt aber nicht notwendigerweise in einer Wissensbank sind, in latente Darstellungen um und führt anschließend die Disambiguierung durch

Universaar

Acronym

MPG.PuRe