2 research outputs found
Practical Comparable Data Collection for Low-Resource Languages via Images
We propose a method of curating high-quality comparable training data for
low-resource languages with monolingual annotators. Our method involves using a
carefully selected set of images as a pivot between the source and target
languages by getting captions for such images in both languages independently.
Human evaluations on the English-Hindi comparable corpora created with our
method show that 81.1% of the pairs are acceptable translations, and only 2.47%
of the pairs are not translations at all. We further establish the potential of
the dataset collected through our approach by experimenting on two downstream
tasks - machine translation and dictionary extraction. All code and data are
available at https://github.com/madaan/PML4DC-Comparable-Data-Collection.Comment: Accepted for poster presentation at the Practical Machine Learning
for Developing Countries (PML4DC) workshop, ICLR 202
Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data
High-performing machine translation (MT) systems can help overcome language
barriers while making it possible for everyone to communicate and use language
technologies in the language of their choice. However, such systems require
large amounts of parallel sentences for training, and translators can be
difficult to find and expensive. Here, we present a data collection strategy
for MT which, in contrast, is cheap and simple, as it does not require
bilingual speakers. Based on the insight that humans pay specific attention to
movements, we use graphics interchange formats (GIFs) as a pivot to collect
parallel sentences from monolingual annotators. We use our strategy to collect
data in Hindi, Tamil and English. As a baseline, we also collect data using
images as a pivot. We perform an intrinsic evaluation by manually evaluating a
subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on
the collected data. We find that sentences collected via GIFs are indeed of
higher quality.Comment: 5 pages, 1 figure, ACL-IJCNLP 2021 submission, Natural Language
Processing, Data Collection, Monolingual Speakers, Machine Translation, GIFs,
Image