Search CORE

28 research outputs found

Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

Author: Harrison Pielke-Lombardo
Lawrence Hunter
Manuel R. Ciosici
Michael Bada
Michael Regan
Negacy Hailu
Sampo Pyysalo
William A Baumgartner Jr.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 28/10/2022
Field of study

As part of the BioNLP Open Shared Tasks 2019, the CRAFT Shared Tasks 2019 provides a platform to gauge the state of the art for three fundamental language processing tasks - dependency parse construction, coreference resolution, and ontology concept identification - over full-text biomedical articles.The structural annotation task requires the automatic generation of dependency parses for each sentence of an article given only the article text. The coreference resolution task focuses on linking coreferring base noun phrase mentions into chains using the symmetrical and transitive identity relation. The ontology concept annotation task involves the identification of concept mentions within text using the classes of ten distinct ontologies in the biomedical domain, both unmodified and augmented with extension classes. This paper provides an overview of each task, including descriptions of the data provided to participants and the evaluation metrics used, and discusses participant results relative to baseline performances for each of the three tasks.</p

UTUPub

NOBLE - Flexible concept recognition for large-scale biomedical natural language processing

Author: Chavan G
Corrigan J
Jacobson RS
Legowski E
Mitchell K
Tseytlin E
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/01/2016
Field of study

Background: Natural language processing (NLP) applications are increasingly important in biomedical data analysis, knowledge engineering, and decision support. Concept recognition is an important component task for NLP pipelines, and can be either general-purpose or domain-specific. We describe a novel, flexible, and general-purpose concept recognition component for NLP pipelines, and compare its speed and accuracy against five commonly used alternatives on both a biological and clinical corpus. NOBLE Coder implements a general algorithm for matching terms to concepts from an arbitrary vocabulary set. The system's matching options can be configured individually or in combination to yield specific system behavior for a variety of NLP tasks. The software is open source, freely available, and easily integrated into UIMA or GATE. We benchmarked speed and accuracy of the system against the CRAFT and ShARe corpora as reference standards and compared it to MMTx, MGrep, Concept Mapper, cTAKES Dictionary Lookup Annotator, and cTAKES Fast Dictionary Lookup Annotator. Results: We describe key advantages of the NOBLE Coder system and associated tools, including its greedy algorithm, configurable matching strategies, and multiple terminology input formats. These features provide unique functionality when compared with existing alternatives, including state-of-the-art systems. On two benchmarking tasks, NOBLE's performance exceeded commonly used alternatives, performing almost as well as the most advanced systems. Error analysis revealed differences in error profiles among systems. Conclusion: NOBLE Coder is comparable to other widely used concept recognition systems in terms of accuracy and speed. Advantages of NOBLE Coder include its interactive terminology builder tool, ease of configuration, and adaptability to various domains and tasks. NOBLE provides a term-to-concept matching system suitable for general concept recognition in biomedical NLP pipelines

Springer - Publisher Connector

PubMed Central

D-Scholarship@Pitt

NOBLE – Flexible concept recognition for large-scale biomedical natural language processing

Author: A Smith
AR Aronson
C Friedman
C Friedman
C Funk
C-N Hsu
CD Manning
D Hanauer
D Tikk
Elizabeth Legowski
Eugene Tseytlin
G Divita
Girish Chavan
GK Savova
J Zheng
JJ Berman
JJ Berman
JJ Cimino
Julia Corrigan
K Liu
K Liu
K Liu
KB Cohen
Kevin Mitchell
M Bada
MA Tanenblatt
ML Zeng
NF de Keizer
NF de Keizer
NH Shah
PM Nadkarni
Rebecca S. Jacobson
RL Trask
RS Crowley
SA Stewart
T Mitsumori
TR Gruber
Z Lu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

The structural and content aspects of abstracts versus bodies of full text journal articles are different

Author: Alias-i
B Settles
BM Szmrecsányi
C Blaschke
C Friedman
C Gasperin
C Gasperin
Christophe Roeder
D Jurafsky
D Klein
DP Corney
G Leroy
Helen L Johnson
I Goldin
J Lin
JG Caporaso
K Bretonnel Cohen
K Verspoor
Karin Verspoor
L Hirschman
L Tanabe
Lawrence E Hunter
M Krallinger
N Elhadad
PG Mutalik
PI Nakov
R Leaman
S Abney
S Agarwal
T McIntosh
W Chapman
W Chapman
W Hersh
WA Baumgartner Jr
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Longtonotes: OntoNotes with Longer Coreference Chains

Author: McCallum Andrew
Monath Nicholas
Sachan Mrinmaya
Shridhar Kumar
Stolfo Alessandro
Thirukovalluru Raghuveer
Zaheer Manzil
Publication venue
Publication date: 07/10/2022
Field of study

Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modeling revealed by our new corpus. Our data and code is available at: https://github.com/kumar-shridhar/LongtoNotes

arXiv.org e-Print Archive

Repository for Publications and Research Data

Entity recognition in the biomedical domain using a hybrid approach

Author: A Tharatipyakul
C Funk
CD Paice
CS Funk
D Campos
D Koning
D Maglott
D Szklarczyk
DM Jessop
E Pafilis
E Tseytlin
F Rinaldi
F Rinaldi
F Rinaldi
F Rinaldi
G Sheikhshab
K Degtyarenko
K Eilbeck
K Verspoor
K Verspoor
M Ashburner
M Bada
M Basaldella
M Basaldella
MF Porter
N Pudota
P Lopez
PD Turney
R Core Team
R Leaman
R Leaman
S Aubin
S Eltyeb
S Tulkens
SA Akhondi
T Groza
T Munkhdalai
U Leser
Y Sasaki
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref