167 research outputs found

    Enhancing Dialogue Generation via Dynamic Graph Knowledge Aggregation

    Full text link
    Incorporating external graph knowledge into neural chatbot models has been proven effective for enhancing dialogue generation. However, in conventional graph neural networks (GNNs), message passing on a graph is independent from text, resulting in the graph representation hidden space differing from that of the text. This training regime of existing models therefore leads to a semantic gap between graph knowledge and text. In this study, we propose a novel framework for knowledge graph enhanced dialogue generation. We dynamically construct a multi-hop knowledge graph with pseudo nodes to involve the language model in feature aggregation within the graph at all steps. To avoid the semantic biases caused by learning on vanilla subgraphs, the proposed framework applies hierarchical graph attention to aggregate graph features on pseudo nodes and then attains a global feature. Therefore, the framework can better utilise the heterogeneous features from both the post and external graph knowledge. Extensive experiments demonstrate that our framework outperforms state-of-the-art (SOTA) baselines on dialogue generation. Further analysis also shows that our representation learning framework can fill the semantic gap by coagulating representations of both text and graph knowledge. Moreover, the language model also learns how to better select knowledge triples for a more informative response via exploiting subgraph patterns within our feature aggregation process. Our code and resources are available at https://github.com/tangg555/SaBART.Comment: Accepted by ACL 202

    A Re-examination of chatbot evaluation metrics

    Get PDF
    One of the most important and challenging parts of developing a chatbot is its evaluation. Judging a conversation depends on the number of complex elements. The objective of the thesis is to understand the characteristics of two types of automated metrics: trained-metric and untrained-metric, and identify the most suitable metrics for dialog evaluation. Moreover, experiments have been conducted to study the weaknesses of word-overlap metrics in morphology-rich language and solutions for that problem. In particular, six evaluation metrics including Kullback–Leibler divergence, Coherence, BLEU, Embedding, Entropy, and MaUde were used for the experiment. In addition, three datasets for two different languages (English, and Finnish) are collected to study whether or not languages can influence the quality of the metrics. The metrics are requested to discriminate between the qualified answers and the unqualified answers. The incorrect answers are generated by randomly sampling sentences, which are not relevant to the context in the database. The obtained results indicate that BLEU for 1-gram and greedy-matching are the two most appropriate options for chatbot evaluation. One solution is found to solve the problem related to morphology-rich language. The efficiency of BLEU in Finnish can be boosted by segmenting words into sub-words or morphemes

    쑰건뢀 ν…μŠ€νŠΈ 생성 μ‹œμŠ€ν…œμ— λŒ€ν•œ 사싀 κ΄€κ³„μ˜ 일관성 평가

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2022. 8. 정ꡐ민.졜근의 μ‚¬μ „ν•™μŠ΅ μ–Έμ–΄λͺ¨λΈμ˜ ν™œμš©μ„ ν†΅ν•œ 쑰건뢀 ν…μŠ€νŠΈ 생성 μ‹œμŠ€ν…œλ“€μ˜ λ°œμ „μ—λ„ λΆˆκ΅¬ν•˜κ³ , μ‹œμŠ€ν…œλ“€μ˜ 사싀 κ΄€κ³„μ˜ 일관성은 μ—¬μ „νžˆ μΆ©λΆ„ν•˜μ§€ μ•Šμ€ νŽΈμ΄λ‹€. κ·ΈλŸ¬λ‚˜ 널리 μ‚¬μš©λ˜λŠ” n-그램 기반 μœ μ‚¬μ„± 평가 기법은 사싀 일관성 평가에 맀우 μ·¨μ•½ν•˜λ‹€. λ”°λΌμ„œ, 사싀 μΌκ΄€λœ ν…μŠ€νŠΈ 생성 μ‹œμŠ€ν…œμ„ κ°œλ°œν•˜κΈ° μœ„ν•΄μ„œλŠ” λ¨Όμ € μ‹œμŠ€ν…œμ˜ 사싀 관계λ₯Ό μ œλŒ€λ‘œ 평가할 수 μžˆλŠ” μžλ™ 평가 기법이 ν•„μš”ν•˜λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ‹€μ–‘ν•œ 쑰건뢀 ν…μŠ€νŠΈ 생성 μ‹œμŠ€ν…œμ— λŒ€ν•΄, 이전 평가 기법보닀 사싀 관계 일관성 ν‰κ°€μ—μ„œ μΈκ°„μ˜ νŒλ‹¨κ³Ό 맀우 높은 상관관계λ₯Ό λ³΄μ—¬μ£ΌλŠ” 4가지 평가 기법을 μ œμ•ˆν•œλ‹€. 이 기법듀은 (1) 보쑰 νƒœμŠ€ν¬ ν™œμš© 및 (2) 데이터 증강 기법 등을 ν™œμš©ν•œλ‹€. 첫째둜, μš°λ¦¬λŠ” μ€‘μš”ν•œ 핡심 λ‹¨μ–΄λ˜λŠ” 핡심 ꡬ문에 μ΄ˆμ μ„ 맞좘 두 가지 λ‹€λ₯Έ 보쑰 νƒœμŠ€ν¬λ₯Ό ν™œμš©ν•˜μ—¬ 두 가지 사싀 κ΄€κ³„μ˜ 일관성 평가 기법을 μ œμ•ˆν•œλ‹€. μš°λ¦¬λŠ” λ¨Όμ € 핡심 ꡬ문의 κ°€μ€‘μΉ˜ 예츑 νƒœμŠ€ν¬λ₯Ό 이전 평가 기법에 κ²°ν•©ν•˜μ—¬ 주관식 질의 응닡을 μœ„ν•œ 평가 기법을 μ œμ•ˆν•œλ‹€. λ˜ν•œ, μš°λ¦¬λŠ” 질의 생성 및 응닡을 ν™œμš©ν•˜μ—¬ ν‚€μ›Œλ“œμ— λŒ€ν•œ 질의λ₯Ό μƒμ„±ν•˜κ³ , 이미지와 μΊ‘μ…˜μ— λŒ€ν•œ 질문의 닡을 λΉ„κ΅ν•˜μ—¬ 사싀 일관성을 ν™•μΈν•˜λŠ” QACEλ₯Ό μ œμ•ˆν•œλ‹€. λ‘˜μ§Έλ‘œ, μš°λ¦¬λŠ” 보쑰 νƒœμŠ€ν¬ ν™œμš©κ³Ό 달리, 데이터 기반 λ°©μ‹μ˜ ν•™μŠ΅μ„ 톡해 두 κ°€μ§€μ˜ 평가 기법을 μ œμ•ˆν•œλ‹€. ꡬ체적으둜, μš°λ¦¬λŠ” μ¦κ°•λœ 일관성 μ—†λŠ” ν…μŠ€νŠΈλ₯Ό 일관성 μžˆλŠ” ν…μŠ€νŠΈμ™€ κ΅¬λΆ„ν•˜λ„λ‘ ν›ˆλ ¨ν•œλ‹€. λ¨Όμ € κ·œμΉ™ 기반 λ³€ν˜•μ„ ν†΅ν•œ 뢈일치 μΊ‘μ…˜ μƒμ„±μœΌλ‘œ 이미지 μΊ‘μ…˜ 평가 μ§€ν‘œ UMIC을 μ œμ•ˆν•œλ‹€. λ‹€μŒ λ‹¨κ³„λ‘œ, λ§ˆμŠ€ν‚Ήλœ μ†ŒμŠ€μ™€ λ§ˆμŠ€ν‚Ήλœ μš”μ•½μ„ μ‚¬μš©ν•˜μ—¬ 일관성이 μ—†λŠ” μš”μ•½μ„ μƒμ„±ν•˜λŠ” MFMAλ₯Ό 톡해 평가 μ§€ν‘œλ₯Ό κ°œλ°œν•œλ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, 데이터 기반 사싀 일관성 평가 기법 개발의 ν™•μž₯으둜, μ‹œμŠ€ν…œμ˜ 사싀 관계 였λ₯˜λ₯Ό μˆ˜μ •ν•  수 μžˆλŠ” λΉ λ₯Έ 사후 ꡐ정 μ‹œμŠ€ν…œμ„ μ œμ•ˆν•œλ‹€.Despite the recent advances of conditional text generation systems leveraged from pre-trained language models, factual consistency of the systems are still not sufficient. However, widely used n-gram similarity metrics are vulnerable to evaluate the factual consistency. Hence, in order to develop a factual consistent system, an automatic factuality metric is first necessary. In this dissertation, we propose four metrics that show very higher correlation with human judgments than previous metrics in evaluating factual consistency, for diverse conditional text generation systems. To build such metrics, we utilize (1) auxiliary tasks and (2) data augmentation methods. First, we focus on the keywords or keyphrases that are critical for evaluating factual consistency and propose two factual consistency metrics using two different auxiliary tasks. We first integrate the keyphrase weights prediction task to the previous metrics to propose a KPQA (Keyphrase Prediction for Question Answering)-metric for generative QA. Also, we apply question generation and answering to develop a captioning metric QACE (Question Answering for Captioning Evaluation). QACE generates questions on the keywords of the candidate. QACE checks the factual consistency by comparing the answers of these questions for the source image and the caption. Secondly, different from using auxiliary tasks, we directly train a metric with a data-driven approach to propose two metrics. Specifically, we train a metric to distinguish augmented inconsistent texts with the consistent text. We first modify the original reference captions to generate inconsistent captions using several rule-based methods such as substituting keywords to propose UMIC (Unreferenced Metric for Image Captioning). As a next step, we introduce a MFMA (Mask-and-Fill with Masked-Article)-metric by generating inconsistent summary using the masked source and the masked summary. Finally, as an extension of developing data-driven factual consistency metrics, we also propose a faster post-editing system that can fix the factual errors in the system.1 Introduction 1 2 Background 10 2.1 Text Evaluation Metrics 10 2.1.1 N-gram Similarity Metrics 10 2.1.2 Embedding Similarity Metrics 12 2.1.3 Auxiliary Task Based Metrics 12 2.1.4 Entailment Based Metrics 13 2.2 Evaluating Automated Metrics 14 3 Integrating Keyphrase Weights for Factual Consistency Evaluation 15 3.1 Related Work 17 3.2 Proposed Approach: KPQA-Metric 18 3.2.1 KPQA 18 3.2.2 KPQA Metric 19 3.3 Experimental Setup and Dataset 23 3.3.1 Dataset 23 3.3.2 Implementation Details 26 3.4 Empirical Results 27 3.4.1 Comparison with Other Methods 27 3.4.2 Analysis 29 3.5 Conclusion 35 4 Question Generation and Question Answering for Factual Consistency Evaluation 36 4.1 Related Work 37 4.2 Proposed Approach: QACE 38 4.2.1 Question Generation 38 4.2.2 Question Answering 39 4.2.3 Abstractive Visual Question Answering 40 4.2.4 QACE Metric 42 4.3 Experimental Setup and Dataset 43 4.3.1 Dataset 43 4.3.2 Implementation Details 44 4.4 Empirical Results 45 4.4.1 Comparison with Other Methods 45 4.4.2 Analysis 46 4.5 Conclusion 48 5 Rule-Based Inconsistent Data Augmentation for Factual Consistency Evaluation 49 5.1 Related Work 51 5.2 Proposed Approach: UMIC 52 5.2.1 Modeling 52 5.2.2 Negative Samples 53 5.2.3 Contrastive Learning 55 5.3 Experimental Setup and Dataset 56 5.3.1 Dataset 56 5.3.2 Implementation Details 60 5.4 Empirical Results 61 5.4.1 Comparison with Other Methods 61 5.4.2 Analysis 62 5.5 Conclusion 65 6 Inconsistent Data Augmentation with Masked Generation for Factual Consistency Evaluation 66 6.1 Related Work 68 6.2 Proposed Approach: MFMA and MSM 70 6.2.1 Mask-and-Fill with Masked Article 71 6.2.2 Masked Summarization 72 6.2.3 Training Factual Consistency Checking Model 72 6.3 Experimental Setup and Dataset 73 6.3.1 Dataset 73 6.3.2 Implementation Details 74 6.4 Empirical Results 75 6.4.1 Comparison with Other Methods 75 6.4.2 Analysis 78 6.5 Conclusion 84 7 Factual Error Correction for Improving Factual Consistency 85 7.1 Related Work 87 7.2 Proposed Approach: RFEC 88 7.2.1 Problem Formulation 88 7.2.2 Training Dataset Construction 89 7.2.3 Evidence Sentence Retrieval 90 7.2.4 Entity Retrieval Based Factual Error Correction 90 7.3 Experimental Setup and Dataset 92 7.3.1 Dataset 92 7.3.2 Implementation Details 93 7.4 Empirical Results 93 7.4.1 Comparison with Other Methods 93 7.4.2 Analysis 95 7.5 Conclusion 95 8 Conclusion 97 Abstract (In Korean) 118λ°•
    • …
    corecore