20 research outputs found
A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction
We present a new method for sentiment lex- icon induction that is designed to be appli- cable to the entire range of typological di- versity of the world’s languages. We eval- uate our method on Parallel Bible Corpus+ (PBC+), a parallel corpus of 1593 languages. The key idea is to use Byte Pair Encodings (BPEs) as basic units for multilingual em- beddings. Through zero-shot transfer from English sentiment, we learn a seed lexicon for each language in the domain of PBC+. Through domain adaptation, we then gener- alize the domain-specific lexicon to a general one. We show – across typologically diverse languages in PBC+ – good quality of seed and general-domain sentiment lexicons by intrin- sic and extrinsic and by automatic and human evaluation. We make freely available our code, seed sentiment lexicons for all 1593 languages and induced general-domain sentiment lexi- cons for 200 language
An Open Dataset and Model for Language Identification
Language identification (LID) is a fundamental step in many natural language
processing pipelines. However, current LID systems are far from perfect,
particularly on lower-resource languages. We present a LID model which achieves
a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201
languages, outperforming previous work. We achieve this by training on a
curated dataset of monolingual data, the reliability of which we ensure by
auditing a sample from each source and each language manually. We make both the
model and the dataset available to the research community. Finally, we carry
out detailed analysis into our model's performance, both in comparison to
existing open models and by language class.Comment: To be published in ACL 202
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2
Generating High-Quality Emotion Arcs For Low-Resource Languages Using Emotion Lexicons
Automatically generated emotion arcs -- that capture how an individual or a
population feels over time -- are widely used in industry and research.
However, there is little work on evaluating the generated arcs in English
(where the emotion resources are available) and no work on generating or
evaluating emotion arcs for low-resource languages. Work on generating emotion
arcs in low-resource languages such as those indigenous to Africa, the
Americas, and Australia is stymied by the lack of emotion-labeled resources and
large language models for those languages. Work on evaluating emotion arcs (for
any language) is scarce because of the difficulty of establishing the true
(gold) emotion arc. Our work, for the first time, systematically and
quantitatively evaluates automatically generated emotion arcs. We also compare
two common ways of generating emotion arcs: Machine-Learning (ML) models and
Lexicon-Only (LexO) methods. By running experiments on 42 diverse datasets in 9
languages, we show that despite being markedly poor at instance level emotion
classification, LexO methods are highly accurate at generating emotion arcs
when aggregating information from hundreds of instances. (Predicted arcs have
correlations ranging from 0.94 to 0.99 with the gold arcs for various
emotions.) We also show that for languages with no emotion lexicons, automatic
translations of English emotion lexicons can be used to generate high-quality
emotion arcs -- correlations above 0.9 with the gold emotion arcs in all six
indigenous African languages explored. This opens up avenues for work on
emotions in numerous languages from around the world; crucial not only for
commerce, public policy, and health research in service of speakers of those
languages, but also to draw meaningful conclusions in emotion-pertinent
research using information from around the world (thereby avoiding a
western-centric bias in research).Comment: 32 pages, 16 figures. arXiv admin note: substantial text overlap with
arXiv:2210.0738
Quantifying the Dialect Gap and its Correlates Across Languages
Historically, researchers and consumers have noticed a decrease in quality
when applying NLP tools to minority variants of languages (i.e. Puerto Rican
Spanish or Swiss German), but studies exploring this have been limited to a
select few languages. Additionally, past studies have mainly been conducted in
a monolingual context, so cross-linguistic trends have not been identified and
tied to external factors. In this work, we conduct a comprehensive evaluation
of the most influential, state-of-the-art large language models (LLMs) across
two high-use applications, machine translation and automatic speech
recognition, to assess their functionality on the regional dialects of several
high- and low-resource languages. Additionally, we analyze how the regional
dialect gap is correlated with economic, social, and linguistic factors. The
impact of training data, including related factors like dataset size and its
construction procedure, is shown to be significant but not consistent across
models or languages, meaning a one-size-fits-all approach cannot be taken in
solving the dialect gap. This work will lay the foundation for furthering the
field of dialectal NLP by laying out evident disparities and identifying
possible pathways for addressing them through mindful data collection.Comment: Accepted to EMNLP Findings 202
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective
International audienceAutomatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields