Search CORE

12 research outputs found

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

Author: Arighi Cecilia N.
Chan Juancarlos
Li Yuling
Muller Hans-Michael
Van Auken Kimberly
Publication venue: 'Oxford University Press (OUP)'
Publication date: 17/01/2013
Field of study

In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators’ overall experience of a system, regardless of the system’s high score on design, learnability and usability. In addition, strategies to refine the annotation guidelines and systems documentation, to adapt the tools to the needs and query types the end user might have and to evaluate performance in terms of efficiency, user interface, result export and traditional evaluation metrics have been analyzed during this task. This analysis will help to plan for a more intense study in BioCreative IV

Caltech Authors

An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

Author: A. Chatr-aryamontri
B. Carterette
B. Harris
C. E. Van Slyke
C. N. Arighi
C. O. Tudor
C. Wu
C.-H. Wei
D. Li
H. Cui
H. Drabkin
H.-M. Muller
J. Chan
J. Chi-Yang Wu
J. M. Cejuela
J. Natarajan
J. P. Balhoff
J. Park
K. B. Cohen
K. Becker
K. Raja
K. Van Auken
L. Cooper
L. Licata
L. Matthews
M. Gillespie
M. Haendel
M. Krallinger
M. L. Schaeffer
P. Dubey
P. Fey
P. Mabee
P. Roberts
R. Dodson
S. Bello
S. Jimenez
S. Subramani
W. Dahdul
W. J. Wilbur
Y. Li
Z. Lu
Publication venue
Publication date: 01/01/2013
Field of study

Crossref

Carolina Digital Repository

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

Author: Davis M.J.
Elangovan A.
Li Y.
Pires D.E.V.
Verspoor K.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis, and Karin Verspoo

arXiv.org e-Print Archive

Adelaide Research & Scholarship

PubMed Central

University of Melbourne Institutional Repository

Text Mining for Protein Docking

Author: Badal Varsha D.
Kundrotas Petras J.
Vakser Ilya A.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/12/2015
Field of study

The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking). Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu). The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features) approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound benchmark set, significantly increasing the docking success rate

KU ScholarWorks

Directory of Open Access Journals

PubMed Central

FigShare

Protein Ontology: Enhancing and scaling up the representation of protein entities

Author: Arighi Cecilia N.
Blake Judith A.
Bona Jonathan
Chen Chuming
Chen Sheng-Chih
Christie Karen R.
Cowart Julie
D'Eustachio Peter
Diehl Alexander D.
Drabkin Harold J.
Duncan William D.
Huang Hongzhan
Natale Darren A.
Ren Jia
Ross Karen
Ruttenberg Alan
Publication venue
Publication date: 01/01/2017
Field of study

The Protein Ontology (PRO; http://purl.obolibrary.org/obo/pr) formally defines and describes taxon-specific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and protein-containing complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translational modification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities

PhilPapers

Text Mining for Protein-Protein Docking

Author: Badal Varsha Dave
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2018
Field of study

Scientific publications are a rich but underutilized source of structural and functional information on proteins and protein interactions. Although scientific literature is intended for human audience, text mining makes it amenable to algorithmic processing. It can focus on extracting information relevant to protein binding modes, providing specific residues that are likely be at the binding site for a given pair of proteins. The knowledge of such residues is a powerful guide for the structural modeling of protein-protein complexes. This work combines and extends two well-established areas of research: the non-structural identification of protein-protein interactors, and structure-based detection of functional (small-ligand) sites on proteins. Text-mining based constraints for protein-protein docking is a unique research direction, which has not been explored prior to this study. Although text mining by itself is unlikely to produce docked models, it is useful in scoring of the docking predictions. Our results show that despite presence of false positives, text mining significantly improves the docking quality. To purge false positives in the mined residues, along with the basic text-mining, this work explores enhanced text mining techniques, using various language processing tools, from simple dictionaries, to WordNet (a generic word ontology), parse trees, word vectors and deep recursive neural networks. The results significantly increase confidence in the generated docking constraints and provide guidelines for the future development of this modeling approach. With the rapid growth of the body of publicly available biomedical literature, and new evolving text-mining methodologies, the approach will become more powerful and adequate to the needs of biomedical community

KU ScholarWorks

A human kinase yeast array for the identification of kinases modulating phosphorylation dependent protein protein interactions

Author: Benlasfer N.
Jehle S.
Kunowska N.
Stelzl U.
Wahl M.C.
Weber G.
Woodsmith J.
Publication venue: 'EMBO'
Publication date: 01/01/2022
Field of study

Protein kinases play an important role in cellular signaling pathways and their dysregulation leads to multiple diseases, making kinases prime drug targets. While more than 500 human protein kinases are known to collectively mediate phosphorylation of over 290,000 S T Y sites, the activities have been characterized only for a minor, intensively studied subset. To systematically address this discrepancy, we developed a human kinase array in Saccharomyces cerevisiae as a simple readout tool to systematically assess kinase activities. For this array, we expressed 266 human kinases in four different S. amp; 8201;cerevisiae strains and profiled ectopic growth as a proxy for kinase activity across 33 conditions. More than half of the kinases showed an activity dependent phenotype across many conditions and in more than one strain. We then employed the kinase array to identify the kinase s that can modulate protein protein interactions PPIs . Two characterized, phosphorylation dependent PPIs with unknown kinase substrate relationships were analyzed in a phospho yeast two hybrid assay. CK2 amp; 945;1 and SGK2 kinases can abrogate the interaction between the spliceosomal proteins AAR2 and PRPF8, and NEK6 kinase was found to mediate the estrogen receptor ER amp; 945; interaction with 14 3 3 proteins. The human kinase yeast array can thus be used for a variety of kinase activity dependent readout

HZB Repository

Directory of Open Access Journals

PubMed Central

Recommended from our members

Towards systems pharmacology models of druggable targets and disease mechanisms

Author: Knight-Schrijver Vincent
Publication venue: University of Cambridge
Publication date: 20/02/2019
Field of study

The development of essential medicines is being slowed by a lack of efficiency in drug development as ninety per cent of drugs fail at some stage during clinical evaluation. This attrition in drug development is seen not because of a reduction in pharmaceutical research expenditure nor is it caused by a declining understanding of biology, if anything, these are both increasing. Instead, drugs are failing because we are unable to effectively predict how they will work before they are given to patients. This is due to limitations of the current methods used to evaluate a drug’s toxicity and efficacy prior to its development. Quite simply, these methods do not account for the full complexity of biology in humans. Systems pharmacology models are a likely candidate for increasing the efficiency of drug discovery as they seek to comprehensively model the fundamental biology of disease mechanisms in a quantit- ative manner. They are computational models, designed and hailed as a strategy for making well-informed and cost effective decisions on drug viability and target druggability and therefore attempt to reduce this time-consuming and costly attrition. Using text mining and text classification I present a growing landscape of systems pharmacology models in literature growing from humble roots because of step-wise increases in our understanding of biology. Furthermore, I develop a case for the capability of systems pharmacology models in making predictions by constructing a model of interleukin-6 signalling for rheumatoid arthritis. This model shows that druggable target selection is not necessarily an intuitive task as it results in an emergent but unanswered hypothesis for safety concerns in a monoclonal antibody. Finally, I show that predictive classification models can also be used to explore gene expression data in a novel work flow by attempting to predict patient response classes to an influenza vaccine.Funded by the BBSRC and GlaxoSmithKline as part of an industrial CASE studentship

Apollo (Cambridge)

Annotating Adverse Outcome Pathways to Organize Toxicological Information for Risk Assessment

Author: Ives Cataia
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2016
Field of study

The Adverse Outcome Pathway (AOP) framework connects molecular perturbations with organism and population level endpoints used for regulatory decision-making by providing a conceptual construct of the mechanistic basis for toxicity. Development of an AOP typically begins with the adverse outcome, and intermediate effects connect the outcome with a molecular initiating event amenable to high-throughput toxicity testing (HTT). Publicly available controlled vocabularies were used to provide terminology supporting AOP’s at all levels of biological organization. The resulting data model contains terms from 22 ontologies and controlled vocabularies annotating currently existing AOP’s. The model provides the ability to attach evidence in support of the AOP, supports data aggregation, and promotes the development of AOP networks. Long term, this structured description of the AOP will enable logical reasoning for hazard identification and for dose-response assessment. Case studies showcase how the model informs AOP development in the context of chemical risk assessment.Master of Scienc

Carolina Digital Repository