2,892 research outputs found
Large Language Models and Knowledge Graphs: Opportunities and Challenges
Large Language Models (LLMs) have taken Knowledge Representation -- and the
world -- by storm. This inflection point marks a shift from explicit knowledge
representation to a renewed focus on the hybrid representation of both explicit
knowledge and parametric knowledge. In this position paper, we will discuss
some of the common debate points within the community on LLMs (parametric
knowledge) and Knowledge Graphs (explicit knowledge) and speculate on
opportunities and visions that the renewed focus brings, as well as related
research topics and challenges.Comment: 30 page
약물 감시를 위한 비정형 텍스트 내 임상 정보 추출 연구
학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 응용바이오공학과, 2023. 2. 이형기.Pharmacovigilance is a scientific activity to detect, evaluate and understand the occurrence of adverse drug events or other problems related to drug safety. However, concerns have been raised over the quality of drug safety information for pharmacovigilance, and there is also a need to secure a new data source to acquire drug safety information. On the other hand, the rise of pre-trained language models
based on a transformer architecture has accelerated the application of natural language processing (NLP) techniques in diverse domains. In this context, I tried to define two problems in pharmacovigilance as an NLP task and provide baseline models for the defined tasks: 1) extracting comprehensive drug safety information from adverse drug events narratives reported through a spontaneous reporting system (SRS) and 2) extracting drug-food interaction information from abstracts of biomedical articles. I developed annotation guidelines and performed manual annotation, demonstrating that strong NLP models can be trained to extracted clinical information from unstructrued free-texts by fine-tuning transformer-based language models on a high-quality annotated corpus. Finally, I discuss issues to consider when when developing annotation guidelines for extracting clinical information related to pharmacovigilance. The annotated corpora and the NLP models in this dissertation can streamline pharmacovigilance activities by enhancing the data quality of reported drug safety information and expanding the data sources.약물 감시는 약물 부작용 또는 약물 안전성과 관련된 문제의 발생을 감지, 평가 및 이해하기 위한 과학적 활동이다. 그러나 약물 감시에 사용되는 의약품 안전성 정보의 보고 품질에 대한 우려가 꾸준히 제기되었으며, 해당 보고 품질을 높이기 위해서는 안전성 정보를 확보할 새로운 자료원이 필요하다. 한편 트랜스포머 아키텍처를 기반으로 사전훈련 언어모델이 등장하면서 다양한 도메인에서 자연어처리 기술 적용이 가속화되었다. 이러한 맥락에서 본 학위 논문에서는 약물 감시를 위한 다음 2가지 정보 추출 문제를 자연어처리 문제 형태로 정의하고 관련 기준 모델을 개발하였다: 1) 수동적 약물 감시 체계에 보고된 이상사례 서술자료에서 포괄적인 약물 안전성 정보를 추출한다. 2) 영문 의약학 논문 초록에서 약물-식품 상호작용 정보를 추출한다. 이를 위해 안전성 정보 추출을 위한 어노테이션 가이드라인을 개발하고 수작업으로 어노테이션을 수행하였다. 결과적으로 고품질의 자연어 학습데이터를 기반으로 사전학습 언어모델을 미세 조정함으로써 비정형 텍스트에서 임상 정보를 추출하는 강력한 자연어처리 모델 개발이 가능함을 확인하였다. 마지막으로 본 학위 논문에서는 약물감시와 관련된임상 정보 추출을 위한 어노테이션 가이드라인을 개발할 때 고려해야 할 주의 사항에 대해 논의하였다. 본 학위 논문에서 소개한 자연어 학습데이터와 자연어처리 모델은 약물 안전성 정보의 보고 품질을 향상시키고 자료원을 확장하여 약물 감시 활동을 보조할 것으로 기대된다.Chapter 1 1
1.1 Contributions of this dissertation 2
1.2 Overview of this dissertation 2
1.3 Other works 3
Chapter 2 4
2.1 Pharmacovigilance 4
2.2 Biomedical NLP for pharmacovigilance 6
2.2.1 Pre-trained language models 6
2.2.2 Corpora to extract clinical information for pharmacovigilance 9
Chapter 3 11
3.1 Motivation 12
3.2 Proposed Methods 14
3.2.1 Data source and text corpus 15
3.2.2 Annotation of ADE narratives 16
3.2.3 Quality control of annotation 17
3.2.4 Pretraining KAERS-BERT 18
3.2.6 Named entity recognition 20
3.2.7 Entity label classification and sentence extraction 21
3.2.8 Relation extraction 21
3.2.9 Model evaluation 22
3.2.10 Ablation experiment 23
3.3 Results 24
3.3.1 Annotated ICSRs 24
3.3.2 Corpus statistics 26
3.3.3 Performance of NLP models to extract drug safety information 28
3.3.4 Ablation experiment 31
3.4 Discussion 33
3.5 Conclusion 38
Chapter 4 39
4.1 Motivation 39
4.2 Proposed Methods 43
4.2.1 Data source 44
4.2.2 Annotation 45
4.2.3 Quality control of annotation 49
4.2.4 Baseline model development 49
4.3 Results 50
4.3.1 Corpus statistics 50
4.3.2 Annotation Quality 54
4.3.3 Performance of baseline models 55
4.3.4 Qualitative error analysis 56
4.4 Discussion 59
4.5 Conclusion 63
Chapter 5 64
5.1 Issues around defining a word entity 64
5.2 Issues around defining a relation between word entities 66
5.3 Issues around defining entity labels 68
5.4 Issues around selecting and preprocessing annotated documents 68
Chapter 6 71
6.1 Dissertation summary 71
6.2 Limitation and future works 72
6.2.1 Development of end-to-end information extraction models from free-texts to database based on existing structured information 72
6.2.2 Application of in-context learning framework in clinical information extraction 74
Chapter 7 76
7.1 Annotation Guideline for "Extraction of Comprehensive Drug Safety Information from Adverse Event Narratives Reported through Spontaneous Reporting System" 76
7.2 Annotation Guideline for "Extraction of Drug-Food Interactions from the Abtracts of Biomedical Articles" 100박
Leveraging literals for knowledge graph embeddings
Wissensgraphen (Knowledge Graphs, KGs) repräsentieren strukturierte Fakten, die sich aus Entitäten und den zwischen diesen bestehenden Relationen zusammensetzen. Um die Effizienz von KG-Anwendungen zu maximieren, ist es von Vorteil, KGs in einen niedrigdimensionalen Vektorraum zu transformieren. KGs folgen dem Paradigma einer offenen Welt (Open World Assumption, OWA), d. h. fehlende Information wird als potenziell möglich angesehen, wodurch ihre Verwendung in realen Anwendungsszenarien oft eingeschränkt wird. Link-Vorhersage (Link Prediction, LP) zur Vervollständigung von KGs kommt daher eine hohe Bedeutung zu. LP kann in zwei unterschiedlichen Modi durchgeführt werden, transduktiv und induktiv, wobei die erste Möglichkeit voraussetzt, dass alle Entitäten der Testdaten in den Trainingsdaten vorhanden sind, während die zweite Möglichkeit auch zuvor nicht bekannte Entitäten in den Testdaten zulässt. Die vorliegende Arbeit untersucht die Verwendung von Literalen in der transduktiven und induktiven LP, da KGs zahlreiche numerische und textuelle Literale enthalten, die eine wesentliche Semantik aufweisen. Zur Evaluierung dieser LP Methoden werden spezielle Benchmark-Datensätze eingeführt.
Insbesondere wird eine neuartige KG Embedding (KGE) Methode, RAILD, vorgeschlagen, die Textliterale zusammen mit kontextuellen Graphinformationen für die LP nutzt. Das Ziel von RAILD ist es, die bestehende Forschungslücke beim Lernen von Embeddings für beim Training ungesehene Relationen zu schließen. Dafür wird eine Architektur vorgeschlagen, die Sprachmodelle (Language Models, LMs) mit Netzwerkembeddings kombiniert. Hierzu erfolgt ein Feintuning von leistungsstarken vortrainierten LMs wie BERT zum Zweck der LP, wobei textuelle Beschreibungen von Entitäten und Relationen genutzt werden. Darüber hinaus wird ein neuer Algorithmus, WeiDNeR, eingeführt, um ein Relationsnetzwerk zu generieren, das zum Erlernen graphbasierter Embeddings von Relationen unter Verwendung eines Netzwerkembeddingsmodells dient. Die Vektorrepräsentationen dieser Relationen werden für die LP kombiniert. Zudem wird ein weiteres neuartiges Embeddingmodell, LitKGE, vorgestellt, das numerische Literale für die transduktive LP verwendet. Es zielt darauf ab, numerische Merkmale für Entitäten durch Graphtraversierung zu erzeugen. Hierfür wird ein weiterer Algorithmus, WeiDNeR_Extended, eingeführt, der ein Netzwerk aus Objekt- und Datentypproperties erzeugt. Aus den aus diesem Netzwerk extrahierten Propertypfaden werden dann numerische Merkmale von Entitäten generiert.
Des Weiteren wird der Einsatz eines mehrsprachigen LM zur Kodierung von Entitätenbeschreibungen in verschiedenen natürlichen Sprachen zum Zweck der LP untersucht. Für die Evaluierung der KGE-Modelle wurden die Benchmark-Datensätze LiterallyWikidata und Wikidata68K erstellt. Die vielversprechenden Ergebnisse, die mit den vorgestellten Modellen erzielt wurden, eröffnen interessante Fragestellungen für die zukünftige Forschung auf dem Gebiet der KGEs und ihrer Folgeanwendungen
Learning Logical Rules from Knowledge Graphs
Ph.D. (Integrated) ThesisExpressing and extracting regularities in multi-relational data, where data points are interrelated
and heterogeneous, requires well-designed knowledge representation. Knowledge Graphs (KGs),
as a graph-based representation of multi-relational data, have seen a rapidly growing presence in
industry and academia, where many real-world applications and academic research are either
enabled or augmented through the incorporation of KGs. However, due to the way KGs are
constructed, they are inherently noisy and incomplete. In this thesis, we focus on developing
logic-based graph reasoning systems that utilize logical rules to infer missing facts for the
completion of KGs. Unlike most rule learners that primarily mine abstract rules that contain
no constants, we are particularly interested in learning instantiated rules that contain constants
due to their ability to represent meaningful patterns and correlations that can not be expressed
by abstract rules. The inclusion of instantiated rules often leads to exponential growth in the
search space. Therefore, it is necessary to develop optimization strategies to balance between
scalability and expressivity. To such an end, we propose GPFL, a probabilistic rule learning
system optimized to mine instantiated rules through the implementation of a novel two-stage
rule generation mechanism. Through experiments, we demonstrate that GPFL not only performs
competitively on knowledge graph completion but is also much more efficient then existing
methods at mining instantiated rules. With GPFL, we also reveal overfitting instantiated rules
and provide detailed analyses about their impact on system performance. Then, we propose RHF,
a generic framework for constructing rule hierarchies from a given set of rules. We demonstrate
through experiments that with RHF and the hierarchical pruning techniques enabled by it,
significant reductions in runtime and rule size are observed due to the pruning of unpromising
rules. Eventually, to test the practicability of rule learning systems, we develop Ranta, a novel
drug repurposing system that relies on logical rules as features to make interpretable inferences.
Ranta outperforms existing methods by a large margin in predictive performance and can make
reasonable repurposing suggestions with interpretable evidence
Enhance Representation Learning of Clinical Narrative with Neural Networks for Clinical Predictive Modeling
Medicine is undergoing a technological revolution. Understanding human health from clinical data has major challenges from technical and practical perspectives, thus prompting methods that understand large, complex, and noisy data. These methods are particularly necessary for natural language data from clinical narratives/notes, which contain some of the richest information on a patient. Meanwhile, deep neural networks have achieved superior performance in a wide variety of natural language processing (NLP) tasks because of their capacity to encode meaningful but abstract representations and learn the entire task end-to-end. In this thesis, I investigate representation learning of clinical narratives with deep neural networks through a number of tasks ranging from clinical concept extraction, clinical note modeling, and patient-level language representation. I present methods utilizing representation learning with neural networks to support understanding of clinical text documents.
I first introduce the notion of representation learning from natural language processing and patient data modeling. Then, I investigate word-level representation learning to improve clinical concept extraction from clinical notes. I present two works on learning word representations and evaluate them to extract important concepts from clinical notes. The first study focuses on cancer-related information, and the second study evaluates shared-task data. The aims of these two studies are to automatically extract important entities from clinical notes. Next, I present a series of deep neural networks to encode hierarchical, longitudinal, and contextual information for modeling a series of clinical notes. I also evaluate the models by predicting clinical outcomes of interest, including mortality, length of stay, and phenotype predictions. Finally, I propose a novel representation learning architecture to develop a generalized and transferable language representation at the patient level. I also identify pre-training tasks appropriate for constructing a generalizable language representation. The main focus is to improve predictive performance of phenotypes with limited data, a challenging task due to a lack of data.
Overall, this dissertation addresses issues in natural language processing for medicine, including clinical text classification and modeling. These studies show major barriers to understanding large-scale clinical notes. It is believed that developing deep representation learning methods for distilling enormous amounts of heterogeneous data into patient-level language representations will improve evidence-based clinical understanding. The approach to solving these issues by learning representations could be used across clinical applications despite noisy data. I conclude that considering different linguistic components in natural language and sequential information between clinical events is important. Such results have implications beyond the immediate context of predictions and further suggest future directions for clinical machine learning research to improve clinical outcomes. This could be a starting point for future phenotyping methods based on natural language processing that construct patient-level language representations to improve clinical predictions. While significant progress has been made, many open questions remain, so I will highlight a few works to demonstrate promising directions
The Emerging Trends of Multi-Label Learning
Exabytes of data are generated daily by humans, leading to the growing need
for new efforts in dealing with the grand challenges for multi-label learning
brought by big data. For example, extreme multi-label classification is an
active and rapidly growing research area that deals with classification tasks
with an extremely large number of classes or labels; utilizing massive data
with limited supervision to build a multi-label classification model becomes
valuable for practical applications, etc. Besides these, there are tremendous
efforts on how to harvest the strong learning capability of deep learning to
better capture the label dependencies in multi-label learning, which is the key
for deep learning to address real-world classification tasks. However, it is
noted that there has been a lack of systemic studies that focus explicitly on
analyzing the emerging trends and new challenges of multi-label learning in the
era of big data. It is imperative to call for a comprehensive survey to fulfill
this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
- …