1,046 research outputs found
Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking
Discovering entity mentions that are out of a Knowledge Base (KB) from texts
plays a critical role in KB maintenance, but has not yet been fully explored.
The current methods are mostly limited to the simple threshold-based approach
and feature-based classification; the datasets for evaluation are relatively
rare. In this work, we propose BLINKout, a new BERT-based Entity Linking (EL)
method which can identify mentions that do not have a corresponding KB entity
by matching them to a special NIL entity. To this end, we integrate novel
techniques including NIL representation, NIL classification, and synonym
enhancement. We also propose Ontology Pruning and Versioning strategies to
construct out-of-KB mentions from normal, in-KB EL datasets. Results on four
datasets of clinical notes and publications show that BLINKout outperforms
existing methods to detect out-of-KB mentions for medical ontologies UMLS and
SNOMED CT
Entity Linking for the Biomedical Domain
Entity linking is the process of detecting mentions of different concepts in text documents and linking them to canonical entities in a target lexicon.
However, one of the biggest issues in entity linking is the ambiguity in entity names. The ambiguity is an issue that many text mining tools have yet to address since different names can represent the same thing and every mention could indicate a different thing. For instance, search engines that rely on heuristic string matches frequently return irrelevant results, because they are unable to satisfactorily resolve ambiguity.
Thus, resolving named entity ambiguity is a crucial step in entity linking. To solve the problem of ambiguity,
this work proposes a heuristic method for entity recognition and entity linking over the biomedical knowledge graph concerning the semantic similarity of entities in the knowledge graph. Named entity recognition (NER), relation extraction (RE), and relationship linking make up a conventional entity linking (EL) system pipeline (RL). We have used the accuracy metric in this thesis.
Therefore, for each identified relation or entity, the solution comprises identifying the correct one and matching it to its corresponding unique CUI in the knowledge base. Because KBs contain a substantial number of relations and entities, each with only one natural language label, the second phase is directly dependent on the accuracy of the first. The framework developed in this thesis enables the extraction of relations and entities from the text and their mapping to the associated CUI in the UMLS knowledge base. This approach derives a new representation of the knowledge base that lends it to the easy comparison. Our idea to select the best candidates is to build a graph of relations and determine the shortest path distance using a ranking approach.
We test our suggested approach on two well-known benchmarks in the biomedical field and show that our method exceeds the search engine's top result and provides us with around 4% more accuracy. In general, when it comes to fine-tuning, we notice that entity linking contains subjective characteristics and modifications may be required depending on the task at hand. The performance of the framework is evaluated based on a Python implementation
Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement
Mentions of new concepts appear regularly in texts and require automated
approaches to harvest and place them into Knowledge Bases (KB), e.g.,
ontologies and taxonomies. Existing datasets suffer from three issues, (i)
mostly assuming that a new concept is pre-discovered and cannot support
out-of-KB mention discovery; (ii) only using the concept label as the input
along with the KB and thus lacking the contexts of a concept label; and (iii)
mostly focusing on concept placement w.r.t a taxonomy of atomic concepts,
instead of complex concepts, i.e., with logical operators. To address these
issues, we propose a new benchmark, adapting MedMentions dataset (PubMed
abstracts) with SNOMED CT versions in 2014 and 2017 under the Diseases
sub-category and the broader categories of Clinical finding, Procedure, and
Pharmaceutical / biologic product. We provide usage on the evaluation with the
dataset for out-of-KB mention discovery and concept placement, adapting recent
Large Language Model based methods.Comment: 5 pages, 1 figure, accepted for CIKM 2023. The dataset, data
construction scripts, and baseline implementation are available at
https://zenodo.org/record/8228005 (Zenodo) and
https://github.com/KRR-Oxford/OET (GitHub
Bi-Encoders based Species Normalization -- Pairwise Sentence Learning to Rank
Motivation: Biomedical named-entity normalization involves connecting
biomedical entities with distinct database identifiers in order to facilitate
data integration across various fields of biology. Existing systems for
biomedical named entity normalization heavily rely on dictionaries, manually
created rules, and high-quality representative features such as lexical or
morphological characteristics. However, recent research has investigated the
use of neural network-based models to reduce dependence on dictionaries,
manually crafted rules, and features. Despite these advancements, the
performance of these models is still limited due to the lack of sufficiently
large training datasets. These models have a tendency to overfit small training
corpora and exhibit poor generalization when faced with previously unseen
entities, necessitating the redesign of rules and features. Contribution: We
present a novel deep learning approach for named entity normalization, treating
it as a pair-wise learning to rank problem. Our method utilizes the widely-used
information retrieval algorithm Best Matching 25 to generate candidate
concepts, followed by the application of bi-directional encoder representation
from the encoder (BERT) to re-rank the candidate list. Notably, our approach
eliminates the need for feature-engineering or rule creation. We conduct
experiments on species entity types and evaluate our method against
state-of-the-art techniques using LINNAEUS and S800 biomedical corpora. Our
proposed approach surpasses existing methods in linking entities to the NCBI
taxonomy. To the best of our knowledge, there is no existing neural
network-based approach for species normalization in the literature
Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text
Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking
GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates
Biomedical entity normalization unifies the language across biomedical
experiments and studies, and further enables us to obtain a holistic view of
life sciences. Current approaches mainly study the normalization of more
standardized entities such as diseases and drugs, while disregarding the more
ambiguous but crucial entities such as pathways, functions and cell types,
hindering their real-world applications. To achieve biomedical entity
normalization on these under-explored entities, we first introduce an
expert-curated dataset OBO-syn encompassing 70 different types of entities and
2 million curated entity-synonym pairs. To utilize the unique graph structure
in this dataset, we propose GraphPrompt, a prompt-based learning approach that
creates prompt templates according to the graphs. GraphPrompt obtained 41.0%
and 29.9% improvement on zero-shot and few-shot settings respectively,
indicating the effectiveness of these graph-based prompt templates. We envision
that our method GraphPrompt and OBO-syn dataset can be broadly applied to
graph-based NLP tasks, and serve as the basis for analyzing diverse and
accumulating biomedical data.Comment: 12 page
Pattern-Based Acquisition of Scientific Entities from Scholarly Article Titles
We describe a rule-based approach for the automatic acquisition of salient scientific entities from Computational Linguistics (CL) scholarly article titles. Two observations motivated the approach: (i) noting salient aspects of an article’s contribution in its title; and (ii) pattern regularities capturing the salient terms that could be expressed in a set of rules. Only those lexico-syntactic patterns were selected that were easily recognizable, occurred frequently, and positionally indicated a scientific entity type. The rules were developed on a collection of 50,237 CL titles covering all articles in the ACL Anthology. In total, 19,799 research problems, 18,111 solutions, 20,033 resources, 1,059 languages, 6,878 tools, and 21,687 methods were extracted at an average precision of 75%
- …