1,634 research outputs found
The TRECVID 2007 BBC rushes summarization evaluation pilot
This paper provides an overview of a pilot evaluation of
video summaries using rushes from several BBC dramatic series. It was carried out under the auspices of TRECVID.
Twenty-two research teams submitted video summaries of
up to 4% duration, of 42 individual rushes video files aimed
at compressing out redundant and insignificant material.
The output of two baseline systems built on straightforward
content reduction techniques was contributed by Carnegie
Mellon University as a control. Procedures for developing
ground truth lists of important segments from each video
were developed at Dublin City University and applied to
the BBC video. At NIST each summary was judged by
three humans with respect to how much of the ground truth
was included, how easy the summary was to understand,
and how much repeated material the summary contained.
Additional objective measures included: how long it took
the system to create the summary, how long it took the assessor to judge it against the ground truth, and what the
summary's duration was. Assessor agreement on finding desired segments averaged 78% and results indicate that while it is difficult to exceed the performance of baselines, a few systems did
Graph-based Neural Multi-Document Summarization
We propose a neural multi-document summarization (MDS) system that
incorporates sentence relation graphs. We employ a Graph Convolutional Network
(GCN) on the relation graphs, with sentence embeddings obtained from Recurrent
Neural Networks as input node features. Through multiple layer-wise
propagation, the GCN generates high-level hidden sentence features for salience
estimation. We then use a greedy heuristic to extract salient sentences while
avoiding redundancy. In our experiments on DUC 2004, we consider three types of
sentence relation graphs and demonstrate the advantage of combining sentence
relations in graphs with the representation power of deep neural networks. Our
model improves upon traditional graph-based extractive approaches and the
vanilla GRU sequence model with no graph, and it achieves competitive results
against other state-of-the-art multi-document summarization systems.Comment: In CoNLL 201
Query-Focused Video Summarization: Dataset, Evaluation, and A Memory Network Based Approach
Recent years have witnessed a resurgence of interest in video summarization.
However, one of the main obstacles to the research on video summarization is
the user subjectivity - users have various preferences over the summaries. The
subjectiveness causes at least two problems. First, no single video summarizer
fits all users unless it interacts with and adapts to the individual users.
Second, it is very challenging to evaluate the performance of a video
summarizer.
To tackle the first problem, we explore the recently proposed query-focused
video summarization which introduces user preferences in the form of text
queries about the video into the summarization process. We propose a memory
network parameterized sequential determinantal point process in order to attend
the user query onto different video frames and shots. To address the second
challenge, we contend that a good evaluation metric for video summarization
should focus on the semantic information that humans can perceive rather than
the visual features or temporal overlaps. To this end, we collect dense
per-video-shot concept annotations, compile a new dataset, and suggest an
efficient evaluation method defined upon the concept annotations. We conduct
extensive experiments contrasting our video summarizer to existing ones and
present detailed analyses about the dataset and the new evaluation method
μ‘°κ±΄λΆ ν μ€νΈ μμ± μμ€ν μ λν μ¬μ€ κ΄κ³μ μΌκ΄μ± νκ°
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2022. 8. μ κ΅λ―Ό.μ΅κ·Όμ μ¬μ νμ΅ μΈμ΄λͺ¨λΈμ νμ©μ ν΅ν μ‘°κ±΄λΆ ν
μ€νΈ μμ± μμ€ν
λ€μ λ°μ μλ λΆκ΅¬νκ³ , μμ€ν
λ€μ μ¬μ€ κ΄κ³μ μΌκ΄μ±μ μ¬μ ν μΆ©λΆνμ§ μμ νΈμ΄λ€. κ·Έλ¬λ λ리 μ¬μ©λλ n-κ·Έλ¨ κΈ°λ° μ μ¬μ± νκ° κΈ°λ²μ μ¬μ€ μΌκ΄μ± νκ°μ 맀μ°
μ·¨μ½νλ€. λ°λΌμ, μ¬μ€ μΌκ΄λ ν
μ€νΈ μμ± μμ€ν
μ κ°λ°νκΈ° μν΄μλ λ¨Όμ μμ€ν
μ μ¬μ€ κ΄κ³λ₯Ό μ λλ‘ νκ°ν μ μλ μλ νκ° κΈ°λ²μ΄ νμνλ€. λ³Έ λ
Όλ¬Έμμλ λ€μν μ‘°κ±΄λΆ ν
μ€νΈ μμ± μμ€ν
μ λν΄, μ΄μ νκ° κΈ°λ²λ³΄λ€ μ¬μ€ κ΄κ³ μΌκ΄μ± νκ°μμ μΈκ°μ νλ¨κ³Ό λ§€μ° λμ μκ΄κ΄κ³λ₯Ό 보μ¬μ£Όλ 4κ°μ§ νκ° κΈ°λ²μ μ μνλ€. μ΄ κΈ°λ²λ€μ (1) 보쑰 νμ€ν¬ νμ© λ° (2) λ°μ΄ν° μ¦κ° κΈ°λ² λ±μ νμ©νλ€.
첫째λ‘, μ°λ¦¬λ μ€μν ν΅μ¬ λ¨μ΄λλ ν΅μ¬ ꡬ문μ μ΄μ μ λ§μΆ λ κ°μ§ λ€λ₯Έ 보쑰 νμ€ν¬λ₯Ό νμ©νμ¬ λ κ°μ§ μ¬μ€ κ΄κ³μ μΌκ΄μ± νκ° κΈ°λ²μ μ μνλ€. μ°λ¦¬λ λ¨Όμ ν΅μ¬ ꡬ문μ κ°μ€μΉ μμΈ‘ νμ€ν¬λ₯Ό μ΄μ νκ° κΈ°λ²μ κ²°ν©νμ¬ μ£Όκ΄μ μ§μ
μλ΅μ μν νκ° κΈ°λ²μ μ μνλ€. λν, μ°λ¦¬λ μ§μ μμ± λ° μλ΅μ νμ©νμ¬ ν€μλμ λν μ§μλ₯Ό μμ±νκ³ , μ΄λ―Έμ§μ μΊ‘μ
μ λν μ§λ¬Έμ λ΅μ λΉκ΅νμ¬ μ¬μ€ μΌκ΄μ±μ νμΈνλ QACEλ₯Ό μ μνλ€.
λμ§Έλ‘, μ°λ¦¬λ 보쑰 νμ€ν¬ νμ©κ³Ό λ¬λ¦¬, λ°μ΄ν° κΈ°λ° λ°©μμ νμ΅μ ν΅ν΄ λ κ°μ§μ νκ° κΈ°λ²μ μ μνλ€. ꡬ체μ μΌλ‘, μ°λ¦¬λ μ¦κ°λ μΌκ΄μ± μλ ν
μ€νΈλ₯Ό μΌκ΄μ± μλ ν
μ€νΈμ ꡬλΆνλλ‘ νλ ¨νλ€. λ¨Όμ κ·μΉ κΈ°λ° λ³νμ ν΅ν λΆμΌμΉ μΊ‘μ
μμ±μΌλ‘ μ΄λ―Έμ§ μΊ‘μ
νκ° μ§ν UMICμ μ μνλ€. λ€μ λ¨κ³λ‘, λ§μ€νΉλ μμ€μ λ§μ€νΉλ μμ½μ μ¬μ©νμ¬ μΌκ΄μ±μ΄ μλ μμ½μ μμ±νλ MFMAλ₯Ό ν΅ν΄ νκ° μ§νλ₯Ό κ°λ°νλ€. λ§μ§λ§μΌλ‘, λ°μ΄ν° κΈ°λ° μ¬μ€ μΌκ΄μ± νκ° κΈ°λ² κ°λ°μ νμ₯μΌλ‘, μμ€ν
μ μ¬μ€ κ΄κ³ μ€λ₯λ₯Ό μμ ν μ μλ λΉ λ₯Έ μ¬ν κ΅μ μμ€ν
μ μ μνλ€.Despite the recent advances of conditional text generation systems leveraged from pre-trained language models, factual consistency of the systems are still not sufficient. However, widely used n-gram similarity metrics are vulnerable to evaluate the factual consistency. Hence, in order to develop a factual consistent system, an automatic factuality metric is first necessary. In this dissertation, we propose four metrics that show very higher correlation with human judgments than previous metrics in evaluating factual consistency, for diverse conditional text generation systems. To build such metrics, we utilize (1) auxiliary tasks and (2) data augmentation methods.
First, we focus on the keywords or keyphrases that are critical for evaluating factual consistency and propose two factual consistency metrics using two different auxiliary tasks. We first integrate the keyphrase weights prediction task to the previous metrics to propose a KPQA (Keyphrase Prediction for Question Answering)-metric for generative QA. Also, we apply question generation and answering to develop a captioning metric QACE (Question Answering for Captioning Evaluation). QACE generates questions on the keywords of the candidate. QACE checks the factual consistency by comparing the answers of these questions for the source image and the caption.
Secondly, different from using auxiliary tasks, we directly train a metric with a data-driven approach to propose two metrics. Specifically, we train a metric to distinguish augmented inconsistent texts with the consistent text. We first modify the original reference captions to generate inconsistent captions using several rule-based methods such as substituting keywords to propose UMIC (Unreferenced Metric for Image Captioning). As a next step, we introduce a MFMA (Mask-and-Fill with Masked-Article)-metric by generating inconsistent summary using the masked source and the masked summary. Finally, as an extension of developing data-driven factual consistency metrics, we also propose a faster post-editing system that can fix the factual errors in the system.1 Introduction 1
2 Background 10
2.1 Text Evaluation Metrics 10
2.1.1 N-gram Similarity Metrics 10
2.1.2 Embedding Similarity Metrics 12
2.1.3 Auxiliary Task Based Metrics 12
2.1.4 Entailment Based Metrics 13
2.2 Evaluating Automated Metrics 14
3 Integrating Keyphrase Weights for Factual Consistency Evaluation 15
3.1 Related Work 17
3.2 Proposed Approach: KPQA-Metric 18
3.2.1 KPQA 18
3.2.2 KPQA Metric 19
3.3 Experimental Setup and Dataset 23
3.3.1 Dataset 23
3.3.2 Implementation Details 26
3.4 Empirical Results 27
3.4.1 Comparison with Other Methods 27
3.4.2 Analysis 29
3.5 Conclusion 35
4 Question Generation and Question Answering for Factual Consistency Evaluation 36
4.1 Related Work 37
4.2 Proposed Approach: QACE 38
4.2.1 Question Generation 38
4.2.2 Question Answering 39
4.2.3 Abstractive Visual Question Answering 40
4.2.4 QACE Metric 42
4.3 Experimental Setup and Dataset 43
4.3.1 Dataset 43
4.3.2 Implementation Details 44
4.4 Empirical Results 45
4.4.1 Comparison with Other Methods 45
4.4.2 Analysis 46
4.5 Conclusion 48
5 Rule-Based Inconsistent Data Augmentation for Factual Consistency Evaluation 49
5.1 Related Work 51
5.2 Proposed Approach: UMIC 52
5.2.1 Modeling 52
5.2.2 Negative Samples 53
5.2.3 Contrastive Learning 55
5.3 Experimental Setup and Dataset 56
5.3.1 Dataset 56
5.3.2 Implementation Details 60
5.4 Empirical Results 61
5.4.1 Comparison with Other Methods 61
5.4.2 Analysis 62
5.5 Conclusion 65
6 Inconsistent Data Augmentation with Masked Generation for Factual Consistency Evaluation 66
6.1 Related Work 68
6.2 Proposed Approach: MFMA and MSM 70
6.2.1 Mask-and-Fill with Masked Article 71
6.2.2 Masked Summarization 72
6.2.3 Training Factual Consistency Checking Model 72
6.3 Experimental Setup and Dataset 73
6.3.1 Dataset 73
6.3.2 Implementation Details 74
6.4 Empirical Results 75
6.4.1 Comparison with Other Methods 75
6.4.2 Analysis 78
6.5 Conclusion 84
7 Factual Error Correction for Improving Factual Consistency 85
7.1 Related Work 87
7.2 Proposed Approach: RFEC 88
7.2.1 Problem Formulation 88
7.2.2 Training Dataset Construction 89
7.2.3 Evidence Sentence Retrieval 90
7.2.4 Entity Retrieval Based Factual Error Correction 90
7.3 Experimental Setup and Dataset 92
7.3.1 Dataset 92
7.3.2 Implementation Details 93
7.4 Empirical Results 93
7.4.1 Comparison with Other Methods 93
7.4.2 Analysis 95
7.5 Conclusion 95
8 Conclusion 97
Abstract (In Korean) 118λ°
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
Leaderboards have eased model development for many NLP datasets by
standardizing their evaluation and delegating it to an independent external
repository. Their adoption, however, is so far limited to tasks that can be
reliably evaluated in an automatic manner. This work introduces GENIE, an
extensible human evaluation leaderboard, which brings the ease of leaderboards
to text generation tasks. GENIE automatically posts leaderboard submissions to
crowdsourcing platforms asking human annotators to evaluate them on various
axes (e.g., correctness, conciseness, fluency) and compares their answers to
various automatic metrics. We introduce several datasets in English to GENIE,
representing four core challenges in text generation: machine translation,
summarization, commonsense reasoning, and machine comprehension. We provide
formal granular evaluation metrics and identify areas for future research. We
make GENIE publicly available and hope that it will spur progress in language
generation models as well as their automatic and manual evaluation
- β¦