800 research outputs found
The Factual Inconsistency Problem in Abstractive Text Summarization: A Survey
Recently, various neural encoder-decoder models pioneered by Seq2Seq
framework have been proposed to achieve the goal of generating more abstractive
summaries by learning to map input text to output text. At a high level, such
neural models can freely generate summaries without any constraint on the words
or phrases used. Moreover, their format is closer to human-edited summaries and
output is more readable and fluent. However, the neural model's abstraction
ability is a double-edged sword. A commonly observed problem with the generated
summaries is the distortion or fabrication of factual information in the
article. This inconsistency between the original text and the summary has
caused various concerns over its applicability, and the previous evaluation
methods of text summarization are not suitable for this issue. In response to
the above problems, the current research direction is predominantly divided
into two categories, one is to design fact-aware evaluation metrics to select
outputs without factual inconsistency errors, and the other is to develop new
summarization systems towards factual consistency. In this survey, we focus on
presenting a comprehensive review of these fact-specific evaluation methods and
text summarization models.Comment: 9 pages, 5 figure
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
We present TriviaQA, a challenging reading comprehension dataset containing
over 650K question-answer-evidence triples. TriviaQA includes 95K
question-answer pairs authored by trivia enthusiasts and independently gathered
evidence documents, six per question on average, that provide high quality
distant supervision for answering the questions. We show that, in comparison to
other recently introduced large-scale datasets, TriviaQA (1) has relatively
complex, compositional questions, (2) has considerable syntactic and lexical
variability between questions and corresponding answer-evidence sentences, and
(3) requires more cross sentence reasoning to find answers. We also present two
baseline algorithms: a feature-based classifier and a state-of-the-art neural
network, that performs well on SQuAD reading comprehension. Neither approach
comes close to human performance (23% and 40% vs. 80%), suggesting that
TriviaQA is a challenging testbed that is worth significant future study. Data
and code available at -- http://nlp.cs.washington.edu/triviaqa/Comment: Added references, fixed typos, minor baseline updat
MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
In this paper, we introduce the MLM (Multiple Languages and Modalities)
dataset - a new resource to train and evaluate multitask systems on samples in
multiple modalities and three languages. The generation process and inclusion
of semantic data provide a resource that further tests the ability for
multitask systems to learn relationships between entities. The dataset is
designed for researchers and developers who build applications that perform
multiple tasks on data encountered on the web and in digital archives. A second
version of MLM provides a geo-representative subset of the data with weighted
samples for countries of the European Union. We demonstrate the value of the
resource in developing novel applications in the digital humanities with a
motivating use case and specify a benchmark set of tasks to retrieve modalities
and locate entities in the dataset. Evaluation of baseline multitask and single
task systems on the full and geo-representative versions of MLM demonstrate the
challenges of generalising on diverse data. In addition to the digital
humanities, we expect the resource to contribute to research in multimodal
representation learning, location estimation, and scene understanding
- …