465 research outputs found

    ๊ณ ์œ ๋ช…์‚ฌ ์ •๊ทœํ™” ๊ธฐ๋ฒ•์„ ์ด์šฉํ•œ ์ง€์‹ ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์‚ฐ์—…๊ณตํ•™๊ณผ, 2023. 2. ์กฐ์„ฑ์ค€.Text mining aims to extract the information from documents to derive valuable insights. The knowledge graph provides richer information from various documents. Past literature responded for such needs by building technology trees or concept network from the bibliographic information of the documents, or by relying on text mining techniques in order to extract keywords and/or phrases. In this paper, we propose a framework for building a knowledge graph using named entities. The knowledge graph construction framework in this paper satisfies the following conditions: (1) extracting the named entity in the completed form, (2) Building datasets that can be trained and be evaluated by the named entity normalization models in various domains such as finance and technical documents in addition to bio-informatics, where existing NEN research has been active, (3) creating the better performing named entity normalization model, and (4) constructing the knowledge graph by grouping named entities with the same meaning that appear in various forms.ํ…์ŠคํŠธ ๋งˆ์ด๋‹์€ ๋‹ค์–‘ํ•œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๋ฌธ์„œ์—์„œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋ฌธ์„œ์˜ ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹ ์ค‘ ํ•˜๋‚˜์ธ ์ง€์‹ ๊ทธ๋ž˜ํ”„๋Š” ๋‹ค์–‘ํ•œ ๋ฌธ์„œ์—์„œ ๋”์šฑ ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๊ธฐ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ์˜ ์ •๋ณด๋“ค๋กœ ๊ธฐ์ˆ  ํŠธ๋ฆฌ ๋˜๋Š” ๊ฐœ๋… ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ฑฐ๋‚˜ ํ‚ค์›Œ๋“œ ๋ฐ ๊ตฌ๋ฌธ์„ ์ถ”์ถœํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์— ์„œ๋Š” ๊ณ ์œ ๋ช…์‚ฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€์‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ์ง€์‹ ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ• ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•œ๋‹ค. (1) ๊ณ ์œ ๋ช…์‚ฌ๋ฅผ ์‚ฌ๋žŒ์ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ํ˜•ํƒœ๋กœ ์ถ”์ถœํ•œ๋‹ค. (2) ๊ธฐ์กด ๊ณ ์œ ๋ช…์‚ฌ ์ •๊ทœํ™” ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ–ˆ๋˜ ์ƒ๋ฌผ์ •๋ณดํ•™ ์™ธ์— ๊ธˆ์œต ๋ฌธ์„œ, ๋ฐ˜๋„์ฒด ๊ด€๋ จ ํŠนํ—ˆ ๋ฌธ์„œ์—์„œ ์ถ”์ถœํ•œ ๊ณ ์œ ๋ช…์‚ฌ๋กœ ๊ณ ์œ ๋ช…์‚ฌ ์ •๊ทœํ™” ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•œ๋‹ค. (3) ๋” ๋‚˜์€ ์„ฑ๋Šฅ์˜ ๊ณ ์œ ๋ช…์‚ฌ ์ •๊ทœํ™” ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•œ๋‹ค. (4) ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋™์ผํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๊ณ ์œ ๋ช…์‚ฌ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ง€์‹ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ตฌ์ถ•ํ•œ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Literature review 5 2.1 Named entity normalization dataset 5 2.2 Named entity normalization 6 2.3 Knowledge graph construction 9 Chapter 3 Dictionary construction for named entity normalization 11 3.1 Background 11 3.2 Dictionary construction methods 12 3.2.1 Finance named entity normalization dataset 12 3.2.2 Patent named entity normalization dataset 18 3.3 Chapter summary 24 Chapter 4 Named entity normalization model using edge weight updating neural network 26 4.1 Background 26 4.2 Proposed model 28 4.2.1 Ground truth entity graph construction 31 4.2.2 Similarity-based entity graph construction 32 4.2.3 Edge weight updating neural network training 35 4.2.4 Edge weight updating neural network inferencing 38 4.3 Experiment results 39 4.3.1 Datasets 39 4.3.2 Experiment settings: named entity normalization in bioinformatics 40 4.3.3 Experiment Settings: Named Entity Normalization in Finance 42 4.4 Results 44 4.4.1 Quantitative Analysis: Bioinformatics 45 4.4.2 QuantitativeAnalysis:Finance 46 4.4.3 QualitativeAnalysis 47 4.5 Chapter summary 51 Chapter 5 Building knowledge graph using named entity recognition and normalization models 53 5.1 Background 53 5.2 Proposed model 55 5.2.1 Named entity normalization 56 5.2.2 Construction of the semiconductor-related patent knowledge graph 61 5.3 Experiment results 62 5.3.1 Comparison models 62 5.3.2 Parameters ettings 64 5.4 Results 64 5.4.1 Quantitative evaluations 64 5.4.2 Qualitative evaluations 70 5.4.3 Knowledge graph visualization and exemplary investigation 71 5.5 Chapter summary 75 Chapter 6 Conclusion 77 6.1 Contributions 77 6.2 Future work 78 Bibliography 79 ๊ตญ๋ฌธ์ดˆ๋ก 92 ๊ฐ์‚ฌ์˜ ๊ธ€ 93๋ฐ•

    BELB: a Biomedical Entity Linking Benchmark

    Full text link
    Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base. It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage knowledge base UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. We therefore developed BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases and spanning six entity types: gene, disease, chemical, species, cell line and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models

    CoSiNES: Contrastive Siamese Network for Entity Standardization

    Full text link
    Entity standardization maps noisy mentions from free-form text to standard entities in a knowledge base. The unique challenge of this task relative to other entity-related tasks is the lack of surrounding context and numerous variations in the surface form of the mentions, especially when it comes to generalization across domains where labeled data is scarce. Previous research mostly focuses on developing models either heavily relying on context, or dedicated solely to a specific domain. In contrast, we propose CoSiNES, a generic and adaptable framework with Contrastive Siamese Network for Entity Standardization that effectively adapts a pretrained language model to capture the syntax and semantics of the entities in a new domain. We construct a new dataset in the technology domain, which contains 640 technical stack entities and 6,412 mentions collected from industrial content management systems. We demonstrate that CoSiNES yields higher accuracy and faster runtime than baselines derived from leading methods in this domain. CoSiNES also achieves competitive performance in four standard datasets from the chemistry, medicine, and biomedical domains, demonstrating its cross-domain applicability.Comment: Accepted by Matching Workshop at ACL202
    • โ€ฆ
    corecore