Search CORE

465 research outputs found

고유명사 정규화 기법을 이용한 지식 그래프 구축

Author: 전성환
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2023. 2. 조성준.Text mining aims to extract the information from documents to derive valuable insights. The knowledge graph provides richer information from various documents. Past literature responded for such needs by building technology trees or concept network from the bibliographic information of the documents, or by relying on text mining techniques in order to extract keywords and/or phrases. In this paper, we propose a framework for building a knowledge graph using named entities. The knowledge graph construction framework in this paper satisfies the following conditions: (1) extracting the named entity in the completed form, (2) Building datasets that can be trained and be evaluated by the named entity normalization models in various domains such as finance and technical documents in addition to bio-informatics, where existing NEN research has been active, (3) creating the better performing named entity normalization model, and (4) constructing the knowledge graph by grouping named entities with the same meaning that appear in various forms.텍스트 마이닝은 다양한 인사이트를 얻기 위해 문서에서 정보를 추출하는 것을 목표로 한다. 문서의 정보를 표현하는 방식 중 하나인 지식 그래프는 다양한 문서에서 더욱 풍부한 정보를 제공한다. 기존 연구들은 텍스트 마이닝 기법을 이용하여 문서의 정보들로 기술 트리 또는 개념 네트워크를 구축하거나 키워드 및 구문을 추출하였다. 본 논문에 서는 고유명사를 이용하여 지식 그래프를 구축하기 위한 프레임워크를 제안한다. 본 논문의 지식 그래프 구축 프레임워크는 다음과 같은 조건을 만족한다. (1) 고유명사를 사람이 이해하기 쉬운 형태로 추출한다. (2) 기존 고유명사 정규화 연구가 활발했던 생물정보학 외에 금융 문서, 반도체 관련 특허 문서에서 추출한 고유명사로 고유명사 정규화 데이터셋을 구축한다. (3) 더 나은 성능의 고유명사 정규화 모델을 구축한다. (4) 다양한 형태의 동일한 의미를 가진 고유명사를 그룹화하여 지식 그래프를 구축한다.Chapter 1 Introduction 1 Chapter 2 Literature review 5 2.1 Named entity normalization dataset 5 2.2 Named entity normalization 6 2.3 Knowledge graph construction 9 Chapter 3 Dictionary construction for named entity normalization 11 3.1 Background 11 3.2 Dictionary construction methods 12 3.2.1 Finance named entity normalization dataset 12 3.2.2 Patent named entity normalization dataset 18 3.3 Chapter summary 24 Chapter 4 Named entity normalization model using edge weight updating neural network 26 4.1 Background 26 4.2 Proposed model 28 4.2.1 Ground truth entity graph construction 31 4.2.2 Similarity-based entity graph construction 32 4.2.3 Edge weight updating neural network training 35 4.2.4 Edge weight updating neural network inferencing 38 4.3 Experiment results 39 4.3.1 Datasets 39 4.3.2 Experiment settings: named entity normalization in bioinformatics 40 4.3.3 Experiment Settings: Named Entity Normalization in Finance 42 4.4 Results 44 4.4.1 Quantitative Analysis: Bioinformatics 45 4.4.2 QuantitativeAnalysis:Finance 46 4.4.3 QualitativeAnalysis 47 4.5 Chapter summary 51 Chapter 5 Building knowledge graph using named entity recognition and normalization models 53 5.1 Background 53 5.2 Proposed model 55 5.2.1 Named entity normalization 56 5.2.2 Construction of the semiconductor-related patent knowledge graph 61 5.3 Experiment results 62 5.3.1 Comparison models 62 5.3.2 Parameters ettings 64 5.4 Results 64 5.4.1 Quantitative evaluations 64 5.4.2 Qualitative evaluations 70 5.4.3 Knowledge graph visualization and exemplary investigation 71 5.5 Chapter summary 75 Chapter 6 Conclusion 77 6.1 Contributions 77 6.2 Future work 78 Bibliography 79 국문초록 92 감사의 글 93박

SNU Open Repository and Archive

BELB: a Biomedical Entity Linking Benchmark

Author: Garda Samuele
Leser Ulf
Martin Robert
Weber-Genzel Leon
Publication venue
Publication date: 22/08/2023
Field of study

Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models

arXiv.org e-Print Archive

CoSiNES: Contrastive Siamese Network for Entity Standardization

Author: Choudhury Mihir
Merler Michele
Pavuluri Raju
Singh Munindar P.
Vukovic Maja
Yuan Jiaqing
Publication venue
Publication date: 05/06/2023
Field of study

Entity standardization maps noisy mentions from free-form text to standard entities in a knowledge base. The unique challenge of this task relative to other entity-related tasks is the lack of surrounding context and numerous variations in the surface form of the mentions, especially when it comes to generalization across domains where labeled data is scarce. Previous research mostly focuses on developing models either heavily relying on context, or dedicated solely to a specific domain. In contrast, we propose CoSiNES, a generic and adaptable framework with Contrastive Siamese Network for Entity Standardization that effectively adapts a pretrained language model to capture the syntax and semantics of the entities in a new domain. We construct a new dataset in the technology domain, which contains 640 technical stack entities and 6,412 mentions collected from industrial content management systems. We demonstrate that CoSiNES yields higher accuracy and faster runtime than baselines derived from leading methods in this domain. CoSiNES also achieves competitive performance in four standard datasets from the chemistry, medicine, and biomedical domains, demonstrating its cross-domain applicability.Comment: Accepted by Matching Workshop at ACL202

arXiv.org e-Print Archive