28,620 research outputs found
BELB: a Biomedical Entity Linking Benchmark
Biomedical entity linking (BEL) is the task of grounding entity mentions to a
knowledge base. It plays a vital role in information extraction pipelines for
the life sciences literature. We review recent work in the field and find that,
as the task is absent from existing benchmarks for biomedical text mining,
different studies adopt different experimental setups making comparisons based
on published numbers problematic. Furthermore, neural systems are tested
primarily on instances linked to the broad coverage knowledge base UMLS,
leaving their performance to more specialized ones, e.g. genes or variants,
understudied. We therefore developed BELB, a Biomedical Entity Linking
Benchmark, providing access in a unified format to 11 corpora linked to 7
knowledge bases and spanning six entity types: gene, disease, chemical,
species, cell line and variant. BELB greatly reduces preprocessing overhead in
testing BEL systems on multiple corpora offering a standardized testbed for
reproducible experiments. Using BELB we perform an extensive evaluation of six
rule-based entity-specific systems and three recent neural approaches
leveraging pre-trained language models. Our results reveal a mixed picture
showing that neural approaches fail to perform consistently across entity
types, highlighting the need of further studies towards entity-agnostic models
Knowledge will Propel Machine Understanding of Content: Extrapolating from Current Examples
Machine Learning has been a big success story during the AI resurgence. One
particular stand out success relates to learning from a massive amount of data.
In spite of early assertions of the unreasonable effectiveness of data, there
is increasing recognition for utilizing knowledge whenever it is available or
can be created purposefully. In this paper, we discuss the indispensable role
of knowledge for deeper understanding of content where (i) large amounts of
training data are unavailable, (ii) the objects to be recognized are complex,
(e.g., implicit entities and highly subjective content), and (iii) applications
need to use complementary or related data in multiple modalities/media. What
brings us to the cusp of rapid progress is our ability to (a) create relevant
and reliable knowledge and (b) carefully exploit knowledge to enhance ML/NLP
techniques. Using diverse examples, we seek to foretell unprecedented progress
in our ability for deeper understanding and exploitation of multimodal data and
continued incorporation of knowledge in learning techniques.Comment: Pre-print of the paper accepted at 2017 IEEE/WIC/ACM International
Conference on Web Intelligence (WI). arXiv admin note: substantial text
overlap with arXiv:1610.0770
HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools
With the exponential growth of the life science literature, biomedical text
mining (BTM) has become an essential technology for accelerating the extraction
of insights from publications. Identifying named entities (e.g., diseases,
drugs, or genes) in texts and their linkage to reference knowledge bases are
crucial steps in BTM pipelines to enable information aggregation from different
documents. However, tools for these two steps are rarely applied in the same
context in which they were developed. Instead, they are applied in the wild,
i.e., on application-dependent text collections different from those used for
the tools' training, varying, e.g., in focus, genre, style, and text type. This
raises the question of whether the reported performance of BTM tools can be
trusted for downstream applications. Here, we report on the results of a
carefully designed cross-corpus benchmark for named entity extraction, where
tools were applied systematically to corpora not used during their training.
Based on a survey of 28 published systems, we selected five for an in-depth
analysis on three publicly available corpora encompassing four different entity
types. Comparison between tools results in a mixed picture and shows that, in a
cross-corpus setting, the performance is significantly lower than the one
reported in an in-corpus setting. HunFlair2 showed the best performance on
average, being closely followed by PubTator. Our results indicate that users of
BTM tools should expect diminishing performances when applying them in the wild
compared to original publications and show that further research is necessary
to make BTM tools more robust
Extraction of chemical-induced diseases using prior knowledge and textual information
We describe our approach to the chemical-disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: Automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached an F-score of 0.757. For CID, the system achieved an F-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improved F-scores (0.828 for DNER and 0.602 for CID)
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Biomedical Entity Recognition by Detection and Matching
Biomedical named entity recognition (BNER) serves as the foundation for
numerous biomedical text mining tasks. Unlike general NER, BNER require a
comprehensive grasp of the domain, and incorporating external knowledge beyond
training data poses a significant challenge. In this study, we propose a novel
BNER framework called DMNER. By leveraging existing entity representation
models SAPBERT, we tackle BNER as a two-step process: entity boundary detection
and biomedical entity matching. DMNER exhibits applicability across multiple
NER scenarios: 1) In supervised NER, we observe that DMNER effectively
rectifies the output of baseline NER models, thereby further enhancing
performance. 2) In distantly supervised NER, combining MRC and AutoNER as span
boundary detectors enables DMNER to achieve satisfactory results. 3) For
training NER by merging multiple datasets, we adopt a framework similar to
DS-NER but additionally leverage ChatGPT to obtain high-quality phrases in the
training. Through extensive experiments conducted on 10 benchmark datasets, we
demonstrate the versatility and effectiveness of DMNER.Comment: 9 pages content, 2 pages appendi
Extraction of chemical-induced diseases using prior knowledge and textual information
We describe our approach to the chemical–disease relation (CDR) task in the BioCreative V challenge. The CDR task consists of two subtasks: automatic disease-named entity recognition and normalization (DNER), and extraction of chemical-induced diseases (CIDs) from Medline abstracts. For the DNER subtask, we used our concept recognition tool Peregrine, in combination with several optimization steps. For the CID subtask, our system, which we named RELigator, was trained on a rich feature set, comprising features derived from a graph database containing prior knowledge about chemicals and diseases, and linguistic and statistical features derived from the abstracts in the CDR training corpus. We describe the systems that were developed and present evaluation results for both subtasks on the CDR test set. For DNER, our Peregrine system reached an F-score of 0.757. For CID, the system achieved an F-score of 0.526, which ranked second among 18 participating teams. Several post-challenge modifications of the systems resulted in substantially improved F-scores (0.828 for DNER and 0.602 for CID). RELigator is available as a web service at http://biosemantics.org/index.php/software/religator
Recommended from our members
Lexical patterns, features and knowledge resources for coreference resolution in clinical notes
Generation of entity coreference chains provides a means to extract linked narrative events from clinical notes, but despite being a well-researched topic in natural language processing, general- purpose coreference tools perform poorly on clinical texts. This paper presents a knowledge-centric and pattern-based approach to resolving coreference across a wide variety of clinical records comprising discharge summaries, progress notes, pathology, radiology and surgical reports from two corpora (Ontology Development and Information Extraction (ODIE) and i2b2/VA). In addition, a method for generating coreference chains using progressively pruned linked lists is demonstrated that reduces the search space and facilitates evaluation by a number of metrics. Independent evaluation results show an F-measure for each corpus of 79.2% and 87.5%, respectively, which offers performance at least as good as human annotators, greatly increased performance over general- purpose tools, and improvement on previously reported clinical coreference systems. The system uses a number of open-source components that are available to download
- …