33,692 research outputs found
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
Natural Language Generation (NLG) for non-English languages is hampered by
the scarcity of datasets in these languages. In this paper, we present the
IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic
languages. We focus on five diverse tasks, namely, biography generation using
Wikipedia infoboxes, news headline generation, sentence summarization,
paraphrase generation and, question generation. We describe the created
datasets and use them to benchmark the performance of several monolingual and
multilingual baselines that leverage pre-trained sequence-to-sequence models.
Our results exhibit the strong performance of multilingual language-specific
pre-trained models, and the utility of models trained on our dataset for other
related NLG tasks. Our dataset creation methods can be easily applied to
modest-resource languages as they involve simple steps such as scraping news
articles and Wikipedia infoboxes, light cleaning, and pivoting through machine
translation data. To the best of our knowledge, the IndicNLG Benchmark is the
first NLG benchmark for Indic languages and the most diverse multilingual NLG
dataset, with approximately 8M examples across 5 tasks and 11 languages. The
datasets and models are publicly available at
https://ai4bharat.iitm.ac.in/indicnlg-suite.Comment: Accepted at EMNLP 202
Multilingual Lexical Simplification via Paraphrase Generation
Lexical simplification (LS) methods based on pretrained language models have
made remarkable progress, generating potential substitutes for a complex word
through analysis of its contextual surroundings. However, these methods require
separate pretrained models for different languages and disregard the
preservation of sentence meaning. In this paper, we propose a novel
multilingual LS method via paraphrase generation, as paraphrases provide
diversity in word selection while preserving the sentence's meaning. We regard
paraphrasing as a zero-shot translation task within multilingual neural machine
translation that supports hundreds of languages. After feeding the input
sentence into the encoder of paraphrase modeling, we generate the substitutes
based on a novel decoding strategy that concentrates solely on the lexical
variations of the complex word. Experimental results demonstrate that our
approach surpasses BERT-based methods and zero-shot GPT3-based method
significantly on English, Spanish, and Portuguese
Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar
This paper presents a currently bilingual but potentially multilingual
FrameNet-based grammar library implemented in Grammatical Framework. The
contribution of this paper is two-fold. First, it offers a methodological
approach to automatically generate the grammar based on semantico-syntactic
valence patterns extracted from FrameNet-annotated corpora. Second, it provides
a proof of concept for two use cases illustrating how the acquired multilingual
grammar can be exploited in different CNL applications in the domains of arts
and tourism
- …