14 research outputs found
Multilingual Lexical Simplification via Paraphrase Generation
Lexical simplification (LS) methods based on pretrained language models have
made remarkable progress, generating potential substitutes for a complex word
through analysis of its contextual surroundings. However, these methods require
separate pretrained models for different languages and disregard the
preservation of sentence meaning. In this paper, we propose a novel
multilingual LS method via paraphrase generation, as paraphrases provide
diversity in word selection while preserving the sentence's meaning. We regard
paraphrasing as a zero-shot translation task within multilingual neural machine
translation that supports hundreds of languages. After feeding the input
sentence into the encoder of paraphrase modeling, we generate the substitutes
based on a novel decoding strategy that concentrates solely on the lexical
variations of the complex word. Experimental results demonstrate that our
approach surpasses BERT-based methods and zero-shot GPT3-based method
significantly on English, Spanish, and Portuguese
Recipes for Adapting Pre-trained Monolingual and Multilingual Models to Machine Translation
There has been recent success in pre-training on monolingual data and
fine-tuning on Machine Translation (MT), but it remains unclear how to best
leverage a pre-trained model for a given MT task. This paper investigates the
benefits and drawbacks of freezing parameters, and adding new ones, when
fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model
trained only on English monolingual data, BART. 2) Fine-tuning a model trained
on monolingual data from 25 languages, mBART. For BART we get the best
performance by freezing most of the model parameters, and adding extra
positional embeddings. For mBART we match the performance of naive fine-tuning
for most language pairs, and outperform it for Nepali to English (0.5 BLEU) and
Czech to English (0.6 BLEU), all with a lower memory cost at training time.
When constraining ourselves to an out-of-domain training set for Vietnamese to
English we outperform the fine-tuning baseline by 0.9 BLEU
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
Confusion Modelling - An Estimation by Semantic Embeddings
Approaching the task of coherence assessment of a conversation from its negative perspective ‘confusion’ rather than coherence itself, has been attempted by very few research works. Influencing Embeddings to learn from similarity/dissimilarity measures such as distance, cosine similarity between two utterances will equip them with the semantics to differentiate a coherent and an incoherent conversation through the detection of negative entity, ‘confusion’. This research attempts to measure coherence of conversation between a human and a conversational agent by means of such semantic embeddings trained from scratch by an architecture centralising the learning from the distance between the embeddings. State of the art performance of general BERT’s embeddings and state of the art performance of ConveRT’s conversation specific embeddings in addition to the GLOVE embeddings are also tested upon the laid architecture. Confusion, being a more sensible entity, real human labelling performance is set as the baseline to evaluate the models. The base design resulted in not such a good performance against the human score but the pre-trained embeddings when plugged into the base architecture had performance boosts in a particular order from lowest to highest, through BERT, GLOVE and ConveRT. The intuition and the efficiency of the base conceptual design is proved of its success when the variant having the ConveRT embeddings plugged into the base design, outperformed the original ConveRT’s state of art performance on generating similarity scores. Though a performance comparable to real human performance was not achieved by the models, there witnessed a considerable overlapping between the ConveRT variant and the human scores which is really a great positive inference to be enjoyed as achieving human performance is always the state of art in any research domain. Also, from the results, this research joins the group of works claiming BERT to be unsuitable for conversation specific modelling and embedding works