Search CORE

578 research outputs found

Large-scale event extraction from literature with multi-level gene normalization

Author: Ananiadou Sophia
Bjorne Jari
Ginter Filip
Hakala Kai
Kao Hung-Yu
Lu Zhiyong
Pyysalo Sampo
Salakoski Tapio
Van de Peer Yves
Van Landeghem Sofie
Wei Chih-Hsuan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license

Directory of Open Access Journals

FigShare

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Overview of BioCreative II gene normalization

Author: Cohen Aaron M
Cohen K Bretonnel
Divoli Anna
Fluck Juliane
Fundel Katrin
Hakenberg Jörg
Hirschman Lynette
Hsu Chun-Nan
Krauthammer Michael
Lau William W
Leaman Robert
Liu Heng-hui
Liu Hongfang
Lu Zhiyong
Morgan Alexander A
Ruch Patrick
Schuemie Martijn
Sun Chengjie
Torres Rafael
Wang Xinglong
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

Springer - Publisher Connector

EUR Research Repository

Erasmus University Digital Repository

BioRED: A Comprehensive Biomedical Relation Extraction Dataset

Author: Arighi Cecilia N
Lai Po-Ting
Lu Zhiyong
Luo Ling
Wei Chih-Hsuan
Publication venue
Publication date: 08/04/2022
Field of study

Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for bio-medical RE only focus on relations of a single type (e.g., protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then we present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene-disease; chemical-chemical), on a set of 600 PubMed articles. Further, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including BERT-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a comprehensive dataset can successfully facilitate the development of more accurate, efficient, and robust RE systems for biomedicine

arXiv.org e-Print Archive

Concept recognition for extracting protein interaction relations from biomedical text

Springer - Publisher Connector

AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature

Author: Beggs Alan H.
Bejerano Gill
Bernstein Jonathan A.
Birgmeier Johannes
Cooper David N.
Deisseroth Cole A.
Diekhans Mark E.
Guturu Harendra
Haeussler Maximilian
Jagadeesh Karthik A.
Ratner Alexander J.
Ré Christopher
Steinberg Ethan H.
Stenson Peter D.
Wenger Aaron M.
Publication venue: 'American Association for the Advancement of Science (AAAS)'
Publication date: 20/05/2020
Field of study

The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient’s disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient’s given set of phenotypes. Diagnosis of singleton patients (without relatives’ exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database–based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children’s Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu

eScholarship - University of California

Mining physical protein-protein interactions from the literature

Author: Ding Shilin
Huang Minlie
Wang Hongning
Zhu Xiaoyan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Springer - Publisher Connector

The gene normalization task in BioCreative III

Author: A McCallum
AA Morgan
AP Dawid
AS Schwartz
B Settles
B Turner
C Lindberg
Cheng-Ju Kuo
Chih-Hsuan Wei
Chun-Nan Hsu
CN Hsu
D Hong-Jie
D Rebholz-Schuhmann
David Campos
DD Lewis
Dina Vishnyakova
E Agirre
F Leitner
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
H Liu
H Liu
Han-Cheol Cho
HD Carroll
Hong-Jie Dai
Hongfang Liu
Hung-Yu Kao
Illes Solt
J Hakenberg
J Whitechill
Jingchen Liu
Karin Verspoor
Kevin M Livingston
KG Dowell
L Hirschman
L Smith
M Ashburner
M Gerner
M Hall
M Huang
Manabu Torii
Martin Gerner
Martin Romacker
ME Colosimo
Minlie Huang
Naoaki Okazaki
P Donmez
P Ruch
P Smyth
P Welinder
Padmini Srinivasan
Patrick Ruch
R Leaman
R Snow
Richard Tzong-Han Tsai
S Bhattacharya
S Brin
S Matos
S Sarntivijai
Sanmitra Bhattacharya
Sergio Matos
Shashank Agarwal
T Kappeler
T Zhang
TH Haveliwala
VC Raykar
VS Sheng
W John Wilbur
X Wang
Z Lu
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance

Springer - Publisher Connector

Large-scale event extraction from literature with multi-level gene normalization

Author: Ananiadou Sophia
Björne Jari
Ginter Filip
Hakala Kai
Kao Hung-Yu
Lu Zhiyong
Pyysalo Sampo
Salakoski Tapio
Van de Peer Yves
Van Landeghem Sofie
Wei Chih-Hsuan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/10/2022
Field of study