465 research outputs found
๊ณ ์ ๋ช ์ฌ ์ ๊ทํ ๊ธฐ๋ฒ์ ์ด์ฉํ ์ง์ ๊ทธ๋ํ ๊ตฌ์ถ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ฐ์
๊ณตํ๊ณผ, 2023. 2. ์กฐ์ฑ์ค.Text mining aims to extract the information from documents to derive valuable insights. The knowledge graph provides richer information from various documents. Past literature responded for such needs by building technology trees or concept network from the bibliographic information of the documents, or by relying on text mining techniques in order to extract keywords and/or phrases. In this paper, we propose a framework for building a knowledge graph using named entities. The knowledge graph construction framework in this paper satisfies the following conditions: (1) extracting the named entity in the completed form, (2) Building datasets that can be trained and be evaluated by the named entity normalization models in various domains such as finance and technical documents in addition to bio-informatics, where existing NEN research has been active, (3) creating the better performing named entity normalization model, and (4) constructing the knowledge graph by grouping named entities with the same meaning that appear in various forms.ํ
์คํธ ๋ง์ด๋์ ๋ค์ํ ์ธ์ฌ์ดํธ๋ฅผ ์ป๊ธฐ ์ํด ๋ฌธ์์์ ์ ๋ณด๋ฅผ ์ถ์ถํ๋ ๊ฒ์ ๋ชฉํ๋ก ํ๋ค. ๋ฌธ์์ ์ ๋ณด๋ฅผ ํํํ๋ ๋ฐฉ์ ์ค ํ๋์ธ ์ง์ ๊ทธ๋ํ๋ ๋ค์ํ ๋ฌธ์์์ ๋์ฑ ํ๋ถํ ์ ๋ณด๋ฅผ ์ ๊ณตํ๋ค. ๊ธฐ์กด ์ฐ๊ตฌ๋ค์ ํ
์คํธ ๋ง์ด๋ ๊ธฐ๋ฒ์ ์ด์ฉํ์ฌ ๋ฌธ์์ ์ ๋ณด๋ค๋ก ๊ธฐ์ ํธ๋ฆฌ ๋๋ ๊ฐ๋
๋คํธ์ํฌ๋ฅผ ๊ตฌ์ถํ๊ฑฐ๋ ํค์๋ ๋ฐ ๊ตฌ๋ฌธ์ ์ถ์ถํ์๋ค. ๋ณธ ๋
ผ๋ฌธ์ ์๋ ๊ณ ์ ๋ช
์ฌ๋ฅผ ์ด์ฉํ์ฌ ์ง์ ๊ทธ๋ํ๋ฅผ ๊ตฌ์ถํ๊ธฐ ์ํ ํ๋ ์์ํฌ๋ฅผ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์ ์ง์ ๊ทธ๋ํ ๊ตฌ์ถ ํ๋ ์์ํฌ๋ ๋ค์๊ณผ ๊ฐ์ ์กฐ๊ฑด์ ๋ง์กฑํ๋ค. (1) ๊ณ ์ ๋ช
์ฌ๋ฅผ ์ฌ๋์ด ์ดํดํ๊ธฐ ์ฌ์ด ํํ๋ก ์ถ์ถํ๋ค. (2) ๊ธฐ์กด ๊ณ ์ ๋ช
์ฌ ์ ๊ทํ ์ฐ๊ตฌ๊ฐ ํ๋ฐํ๋ ์๋ฌผ์ ๋ณดํ ์ธ์ ๊ธ์ต ๋ฌธ์, ๋ฐ๋์ฒด ๊ด๋ จ ํนํ ๋ฌธ์์์ ์ถ์ถํ ๊ณ ์ ๋ช
์ฌ๋ก ๊ณ ์ ๋ช
์ฌ ์ ๊ทํ ๋ฐ์ดํฐ์
์ ๊ตฌ์ถํ๋ค. (3) ๋ ๋์ ์ฑ๋ฅ์ ๊ณ ์ ๋ช
์ฌ ์ ๊ทํ ๋ชจ๋ธ์ ๊ตฌ์ถํ๋ค. (4) ๋ค์ํ ํํ์ ๋์ผํ ์๋ฏธ๋ฅผ ๊ฐ์ง ๊ณ ์ ๋ช
์ฌ๋ฅผ ๊ทธ๋ฃนํํ์ฌ ์ง์ ๊ทธ๋ํ๋ฅผ ๊ตฌ์ถํ๋ค.Chapter 1 Introduction 1
Chapter 2 Literature review 5
2.1 Named entity normalization dataset 5
2.2 Named entity normalization 6
2.3 Knowledge graph construction 9
Chapter 3 Dictionary construction for named entity normalization 11
3.1 Background 11
3.2 Dictionary construction methods 12
3.2.1 Finance named entity normalization dataset 12
3.2.2 Patent named entity normalization dataset 18
3.3 Chapter summary 24
Chapter 4 Named entity normalization model using edge weight updating neural network 26
4.1 Background 26
4.2 Proposed model 28
4.2.1 Ground truth entity graph construction 31
4.2.2 Similarity-based entity graph construction 32
4.2.3 Edge weight updating neural network training 35
4.2.4 Edge weight updating neural network inferencing 38
4.3 Experiment results 39
4.3.1 Datasets 39
4.3.2 Experiment settings: named entity normalization in bioinformatics 40
4.3.3 Experiment Settings: Named Entity Normalization in Finance 42
4.4 Results 44
4.4.1 Quantitative Analysis: Bioinformatics 45
4.4.2 QuantitativeAnalysis:Finance 46
4.4.3 QualitativeAnalysis 47
4.5 Chapter summary 51
Chapter 5 Building knowledge graph using named entity recognition and normalization models 53
5.1 Background 53
5.2 Proposed model 55
5.2.1 Named entity normalization 56
5.2.2 Construction of the semiconductor-related patent knowledge graph 61
5.3 Experiment results 62
5.3.1 Comparison models 62
5.3.2 Parameters ettings 64
5.4 Results 64
5.4.1 Quantitative evaluations 64
5.4.2 Qualitative evaluations 70
5.4.3 Knowledge graph visualization and exemplary investigation 71
5.5 Chapter summary 75
Chapter 6 Conclusion 77
6.1 Contributions 77
6.2 Future work 78
Bibliography 79
๊ตญ๋ฌธ์ด๋ก 92
๊ฐ์ฌ์ ๊ธ 93๋ฐ
BELB: a Biomedical Entity Linking Benchmark
Biomedical entity linking (BEL) is the task of grounding entity mentions to a
knowledge base. It plays a vital role in information extraction pipelines for
the life sciences literature. We review recent work in the field and find that,
as the task is absent from existing benchmarks for biomedical text mining,
different studies adopt different experimental setups making comparisons based
on published numbers problematic. Furthermore, neural systems are tested
primarily on instances linked to the broad coverage knowledge base UMLS,
leaving their performance to more specialized ones, e.g. genes or variants,
understudied. We therefore developed BELB, a Biomedical Entity Linking
Benchmark, providing access in a unified format to 11 corpora linked to 7
knowledge bases and spanning six entity types: gene, disease, chemical,
species, cell line and variant. BELB greatly reduces preprocessing overhead in
testing BEL systems on multiple corpora offering a standardized testbed for
reproducible experiments. Using BELB we perform an extensive evaluation of six
rule-based entity-specific systems and three recent neural approaches
leveraging pre-trained language models. Our results reveal a mixed picture
showing that neural approaches fail to perform consistently across entity
types, highlighting the need of further studies towards entity-agnostic models
CoSiNES: Contrastive Siamese Network for Entity Standardization
Entity standardization maps noisy mentions from free-form text to standard
entities in a knowledge base. The unique challenge of this task relative to
other entity-related tasks is the lack of surrounding context and numerous
variations in the surface form of the mentions, especially when it comes to
generalization across domains where labeled data is scarce. Previous research
mostly focuses on developing models either heavily relying on context, or
dedicated solely to a specific domain. In contrast, we propose CoSiNES, a
generic and adaptable framework with Contrastive Siamese Network for Entity
Standardization that effectively adapts a pretrained language model to capture
the syntax and semantics of the entities in a new domain.
We construct a new dataset in the technology domain, which contains 640
technical stack entities and 6,412 mentions collected from industrial content
management systems. We demonstrate that CoSiNES yields higher accuracy and
faster runtime than baselines derived from leading methods in this domain.
CoSiNES also achieves competitive performance in four standard datasets from
the chemistry, medicine, and biomedical domains, demonstrating its cross-domain
applicability.Comment: Accepted by Matching Workshop at ACL202
- โฆ