79 research outputs found
An Analysis of Source-Side Grammatical Errors in NMT
The quality of Neural Machine Translation (NMT) has been shown to
significantly degrade when confronted with source-side noise. We present the
first large-scale study of state-of-the-art English-to-German NMT on real
grammatical noise, by evaluating on several Grammar Correction corpora. We
present methods for evaluating NMT robustness without true references, and we
use them for extensive analysis of the effects that different grammatical
errors have on the NMT output. We also introduce a technique for visualizing
the divergence distribution caused by a source-side error, which allows for
additional insights.Comment: Accepted and to be presented at BlackboxNLP 201
Pushing the Limits of Low-Resource Morphological Inflection
Recent years have seen exceptional strides in the task of automatic
morphological inflection generation. However, for a long tail of languages the
necessary resources are hard to come by, and state-of-the-art neural methods
that work well under higher resource settings perform poorly in the face of a
paucity of data. In response, we propose a battery of improvements that greatly
improve performance under such low-resource conditions. First, we present a
novel two-step attention architecture for the inflection decoder. In addition,
we investigate the effects of cross-lingual transfer from single and multiple
languages, as well as monolingual data hallucination. The macro-averaged
accuracy of our models outperforms the state-of-the-art by 15 percentage
points. Also, we identify the crucial factors for success with cross-lingual
transfer for morphological inflection: typological similarity and a common
representation across languages.Comment: to appear at EMNLP 201
An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models
The capacity and effectiveness of pre-trained multilingual models (MLMs) for
zero-shot cross-lingual transfer is well established. However, phenomena of
positive or negative transfer, and the effect of language choice still need to
be fully understood, especially in the complex setting of massively
multilingual LMs. We propose an \textit{efficient} method to study transfer
language influence in zero-shot performance on another target language. Unlike
previous work, our approach disentangles downstream tasks from language, using
dedicated adapter units. Our findings suggest that some languages do not
largely affect others, while some languages, especially ones unseen during
pre-training, can be extremely beneficial or detrimental for different target
languages. We find that no transfer language is beneficial for all target
languages. We do, curiously, observe languages previously unseen by MLMs
consistently benefit from transfer from almost any language. We additionally
use our modular approach to quantify negative interference efficiently and
categorize languages accordingly. Furthermore, we provide a list of promising
transfer-target language configurations that consistently lead to target
language performance improvements. Code and data are publicly available:
https://github.com/ffaisal93/neg_in
Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
Large pretrained multilingual models, trained on dozens of languages, have
delivered promising results due to cross-lingual learning capabilities on
variety of language tasks. Further adapting these models to specific languages,
especially ones unseen during pre-training, is an important goal towards
expanding the coverage of language technologies. In this study, we show how we
can use language phylogenetic information to improve cross-lingual transfer
leveraging closely related languages in a structured, linguistically-informed
manner. We perform adapter-based training on languages from diverse language
families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic
and semantic tasks, obtaining more than 20% relative performance improvements
over strong commonly used baselines, especially on languages unseen during
pre-training.Comment: accepted in AACL 2022 Main Conferenc
A Study on Scaling Up Multilingual News Framing Analysis
Media framing is the study of strategically selecting and presenting specific
aspects of political issues to shape public opinion. Despite its relevance to
almost all societies around the world, research has been limited due to the
lack of available datasets and other resources. This study explores the
possibility of dataset creation through crowdsourcing, utilizing non-expert
annotators to develop training corpora. We first extend framing analysis beyond
English news to a multilingual context (12 typologically diverse languages)
through automatic translation. We also present a novel benchmark in Bengali and
Portuguese on the immigration and same-sex marriage domains. Additionally, we
show that a system trained on our crowd-sourced dataset, combined with other
existing ones, leads to a 5.32 percentage point increase from the baseline,
showing that crowdsourcing is a viable option. Last, we study the performance
of large language models (LLMs) for this task, finding that task-specific
fine-tuning is a better approach than employing bigger non-specialized models.Comment: accepted at NAACL 202
Script-Agnostic Language Identification
Language identification is used as the first step in many data collection and
crawling efforts because it allows us to sort online text into
language-specific buckets. However, many modern languages, such as Konkani,
Kashmiri, Punjabi etc., are synchronically written in several scripts.
Moreover, languages with different writing systems do not share significant
lexical, semantic, and syntactic properties in neural representation spaces,
which is a disadvantage for closely related languages and low-resource
languages, especially those from the Indian Subcontinent. To counter this, we
propose learning script-agnostic representations using several different
experimental strategies (upscaling, flattening, and script mixing) focusing on
four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find
that word-level script randomization and exposure to a language written in
multiple scripts is extremely valuable for downstream script-agnostic language
identification, while also maintaining competitive performance on naturally
occurring text.Comment: Under Review in ACL Rolling Revie
- …