Search CORE

33,692 research outputs found

IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Author: Dabre Raj
Khapra Mitesh M.
Kumar Aman
Kumar Pratyush
Kunchukuttan Anoop
Mishra Amogh
Puduppully Ratish
Sahu Prachi
Shrotriya Himani
Publication venue
Publication date: 26/10/2022
Field of study

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at https://ai4bharat.iitm.ac.in/indicnlg-suite.Comment: Accepted at EMNLP 202

arXiv.org e-Print Archive

Multilingual Lexical Simplification via Paraphrase Generation

Author: Hua Kaixun
Li Yun
Liu Kang
Qiang Jipeng
Yuan Yunhao
Zhu Yi
Publication venue
Publication date: 27/07/2023
Field of study

Lexical simplification (LS) methods based on pretrained language models have made remarkable progress, generating potential substitutes for a complex word through analysis of its contextual surroundings. However, these methods require separate pretrained models for different languages and disregard the preservation of sentence meaning. In this paper, we propose a novel multilingual LS method via paraphrase generation, as paraphrases provide diversity in word selection while preserving the sentence's meaning. We regard paraphrasing as a zero-shot translation task within multilingual neural machine translation that supports hundreds of languages. After feeding the input sentence into the encoder of paraphrase modeling, we generate the substitutes based on a novel decoding strategy that concentrates solely on the lexical variations of the complex word. Experimental results demonstrate that our approach surpasses BERT-based methods and zero-shot GPT3-based method significantly on English, Spanish, and Portuguese

arXiv.org e-Print Archive

Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar

Author: A. Ranta
A. Ranta
B. Davis
C.J. Fillmore
D. Das
N. Gruzitis
N. Gruzitis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

This paper presents a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet-annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism

arXiv.org e-Print Archive

Crossref