3 research outputs found
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
Neural machine translation (NMT) systems exhibit limited robustness in
handling source-side linguistic variations. Their performance tends to degrade
when faced with even slight deviations in language usage, such as different
domains or variations introduced by second-language speakers. It is intuitive
to extend this observation to encompass dialectal variations as well, but the
work allowing the community to evaluate MT systems on this dimension is
limited. To alleviate this issue, we compile and release \dataset, a
contrastive dialectal benchmark encompassing 882 different variations from nine
different languages. We also quantitatively demonstrate the challenges large MT
models face in effectively translating dialectal variants. We are releasing all
code and data
BIG-C: a Multimodal Multi-Purpose Dataset for Bemba
We present BIG-C (Bemba Image Grounded Conversations), a large multimodal
dataset for Bemba. While Bemba is the most populous language of Zambia, it
exhibits a dearth of resources which render the development of language
technologies or language processing research almost impossible. The dataset is
comprised of multi-turn dialogues between Bemba speakers based on images,
transcribed and translated into English. There are more than 92,000
utterances/sentences, amounting to more than 180 hours of audio data with
corresponding transcriptions and English translations. We also provide
baselines on speech recognition (ASR), machine translation (MT) and speech
translation (ST) tasks, and sketch out other potential future multimodal uses
of our dataset. We hope that by making the dataset available to the research
community, this work will foster research and encourage collaboration across
the language, speech, and vision communities especially for languages outside
the "traditionally" used high-resourced ones. All data and code are publicly
available: https://github.com/csikasote/bigc.Comment: accepted to ACL 202