2 research outputs found
NCLS: Neural Cross-Lingual Summarization
Cross-lingual summarization (CLS) is the task to produce a summary in one
particular language for a source document in a different language. Existing
methods simply divide this task into two steps: summarization and translation,
leading to the problem of error propagation. To handle that, we present an
end-to-end CLS framework, which we refer to as Neural Cross-Lingual
Summarization (NCLS), for the first time. Moreover, we propose to further
improve NCLS by incorporating two related tasks, monolingual summarization and
machine translation, into the training process of CLS under multi-task
learning. Due to the lack of supervised CLS data, we propose a round-trip
translation strategy to acquire two high-quality large-scale CLS datasets based
on existing monolingual summarization datasets. Experimental results have shown
that our NCLS achieves remarkable improvement over traditional pipeline methods
on both English-to-Chinese and Chinese-to-English CLS human-corrected test
sets. In addition, NCLS with multi-task learning can further significantly
improve the quality of generated summaries. We make our dataset and code
publicly available here: http://www.nlpr.ia.ac.cn/cip/dataset.htm.Comment: Accepted to EMNLP-IJCNLP 201
WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization
We introduce WikiLingua, a large-scale, multilingual dataset for the
evaluation of crosslingual abstractive summarization systems. We extract
article and summary pairs in 18 languages from WikiHow, a high quality,
collaborative resource of how-to guides on a diverse set of topics written by
human authors. We create gold-standard article-summary alignments across
languages by aligning the images that are used to describe each how-to step in
an article. As a set of baselines for further studies, we evaluate the
performance of existing cross-lingual abstractive summarization methods on our
dataset. We further propose a method for direct crosslingual summarization
(i.e., without requiring translation at inference time) by leveraging synthetic
data and Neural Machine Translation as a pre-training step. Our method
significantly outperforms the baseline approaches, while being more cost
efficient during inference.Comment: Findings of EMNLP 202