17 research outputs found
SwissDial: Parallel Multidialectal Corpus of Spoken Swiss German
Swiss German is a dialect continuum whose natively acquired dialects
significantly differ from the formal variety of the language. These dialects
are mostly used for verbal communication and do not have standard orthography.
This has led to a lack of annotated datasets, rendering the use of many NLP
methods infeasible. In this paper, we introduce the first annotated parallel
corpus of spoken Swiss German across 8 major dialects, plus a Standard German
reference. Our goal has been to create and to make available a basic dataset
for employing data-driven NLP applications in Swiss German. We present our data
collection procedure in detail and validate the quality of our corpus by
conducting experiments with the recent neural models for speech synthesis
ProMap: Effective Bilingual Lexicon Induction via Language Model Prompting
Bilingual Lexicon Induction (BLI), where words are translated between two
languages, is an important NLP task. While noticeable progress on BLI in rich
resource languages using static word embeddings has been achieved. The word
translation performance can be further improved by incorporating information
from contextualized word embeddings. In this paper, we introduce ProMap, a
novel approach for BLI that leverages the power of prompting pretrained
multilingual and multidialectal language models to address these challenges. To
overcome the employment of subword tokens in these models, ProMap relies on an
effective padded prompting of language models with a seed dictionary that
achieves good performance when used independently. We also demonstrate the
effectiveness of ProMap in re-ranking results from other BLI methods such as
with aligned static word embeddings. When evaluated on both rich-resource and
low-resource languages, ProMap consistently achieves state-of-the-art results.
Furthermore, ProMap enables strong performance in few-shot scenarios (even with
less than 10 training examples), making it a valuable tool for low-resource
language translation. Overall, we believe our method offers both exciting and
promising direction for BLI in general and low-resource languages in
particular. ProMap code and data are available at
\url{https://github.com/4mekki4/promap}.Comment: To appear in IJCNLP-AACL 202
The Paradigm Discovery Problem
This work treats the paradigm discovery problem (PDP), the task of learning
an inflectional morphological system from unannotated sentences. We formalize
the PDP and develop evaluation metrics for judging systems. Using currently
available resources, we construct datasets for the task. We also devise a
heuristic benchmark for the PDP and report empirical results on five diverse
languages. Our benchmark system first makes use of word embeddings and string
similarity to cluster forms by cell and by paradigm. Then, we bootstrap a
neural transducer on top of the clustered data to predict words to realize the
empty paradigm slots. An error analysis of our system suggests clustering by
cell across different inflection classes is the most pressing challenge for
future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202
Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging
Semitic languages can be highly ambiguous, having several interpretations of
the same surface forms, and morphologically rich, having many morphemes that
realize several morphological features. This is further exacerbated for
dialectal content, which is more prone to noise and lacks a standard
orthography. The morphological features can be lexicalized, like lemmas and
diacritized forms, or non-lexicalized, like gender, number, and part-of-speech
tags, among others. Joint modeling of the lexicalized and non-lexicalized
features can identify more intricate morphological patterns, which provide
better context modeling, and further disambiguate ambiguous lexical choices.
However, the different modeling granularity can make joint modeling more
difficult. Our approach models the different features jointly, whether
lexicalized (on the character-level), where we also model surface form
normalization, or non-lexicalized (on the word-level). We use Arabic as a test
case, and achieve state-of-the-art results for Modern Standard Arabic, with 20%
relative error reduction, and Egyptian Arabic (a dialectal variant of Arabic),
with 11% reduction
A review of sentiment analysis research in Arabic language
Sentiment analysis is a task of natural language processing which has
recently attracted increasing attention. However, sentiment analysis research
has mainly been carried out for the English language. Although Arabic is
ramping up as one of the most used languages on the Internet, only a few
studies have focused on Arabic sentiment analysis so far. In this paper, we
carry out an in-depth qualitative study of the most important research works in
this context by presenting limits and strengths of existing approaches. In
particular, we survey both approaches that leverage machine translation or
transfer learning to adapt English resources to Arabic and approaches that stem
directly from the Arabic language
Recommended from our members
Sociolinguistically Driven Approaches for Just Natural Language Processing
Natural language processing (NLP) systems are now ubiquitous. Yet the benefits of these language technologies do not accrue evenly to all users, and indeed they can be harmful; NLP systems reproduce stereotypes, prevent speakers of non-standard language varieties from participating fully in public discourse, and re-inscribe historical patterns of linguistic stigmatization and discrimination. How harms arise in NLP systems, and who is harmed by them, can only be understood at the intersection of work on NLP, fairness and justice in machine learning, and the relationships between language and social justice. In this thesis, we propose to address two questions at this intersection: i) How can we conceptualize harms arising from NLP systems?, and ii) How can we quantify such harms?
We propose the following contributions. First, we contribute a model in order to collect the first large dataset of African American Language (AAL)-like social media text. We use the dataset to quantify the performance of two types of NLP systems, identifying disparities in model performance between Mainstream U.S. English (MUSE)- and AAL-like text. Turning to the landscape of bias in NLP more broadly, we then provide a critical survey of the emerging literature on bias in NLP and identify its limitations. Drawing on work across sociology, sociolinguistics, linguistic anthropology, social psychology, and education, we provide an account of the relationships between language and injustice, propose a taxonomy of harms arising from NLP systems grounded in those relationships, and propose a set of guiding research questions for work on bias in NLP. Finally, we adapt the measurement modeling framework from the quantitative social sciences to effectively evaluate approaches for quantifying bias in NLP systems. We conclude with a discussion of recent work on bias through the lens of style in NLP, raising a set of normative questions for future work