3 research outputs found
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages
Lack of encyclopedic text contributors, especially on Wikipedia, makes
automated text generation for low resource (LR) languages a critical problem.
Existing work on Wikipedia text generation has focused on English only where
English reference articles are summarized to generate English Wikipedia pages.
But, for low-resource languages, the scarcity of reference articles makes
monolingual summarization ineffective in solving this problem. Hence, in this
work, we propose XWikiGen, which is the task of cross-lingual multi-document
summarization of text from multiple reference articles, written in various
languages, to generate Wikipedia-style text. Accordingly, we contribute a
benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five
domains and eight languages. We harness this dataset to train a two-stage
system where the input is a set of citations and a section title and the output
is a section-specific LR summary. The proposed system is based on a novel idea
of neural unsupervised extractive summarization to coarsely identify salient
information followed by a neural abstractive model to generate the
section-specific text. Extensive experiments show that multi-domain training is
better than the multi-lingual setup on average
XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages
Multiple business scenarios require an automated generation of descriptive
human-readable text from structured input data. Hence, fact-to-text generation
systems have been developed for various downstream tasks like generating soccer
reports, weather and financial reports, medical reports, person biographies,
etc. Unfortunately, previous work on fact-to-text (F2T) generation has focused
primarily on English mainly due to the high availability of relevant datasets.
Only recently, the problem of cross-lingual fact-to-text (XF2T) was proposed
for generation across multiple languages alongwith a dataset, XALIGN for eight
languages. However, there has been no rigorous work on the actual XF2T
generation problem. We extend XALIGN dataset with annotated data for four more
languages: Punjabi, Malayalam, Assamese and Oriya. We conduct an extensive
study using popular Transformer-based text generation models on our extended
multi-lingual dataset, which we call XALIGNV2. Further, we investigate the
performance of different text generation strategies: multiple variations of
pretraining, fact-aware embeddings and structure-aware input encoding. Our
extensive experiments show that a multi-lingual mT5 model which uses fact-aware
embeddings with structure-aware input encoding leads to best results on average
across the twelve languages. We make our code, dataset and model publicly
available, and hope that this will help advance further research in this
critical area