2 research outputs found

    Practical Comparable Data Collection for Low-Resource Languages via Images

    Full text link
    We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all. We further establish the potential of the dataset collected through our approach by experimenting on two downstream tasks - machine translation and dictionary extraction. All code and data are available at https://github.com/madaan/PML4DC-Comparable-Data-Collection.Comment: Accepted for poster presentation at the Practical Machine Learning for Developing Countries (PML4DC) workshop, ICLR 202

    Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

    Full text link
    High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on the collected data. We find that sentences collected via GIFs are indeed of higher quality.Comment: 5 pages, 1 figure, ACL-IJCNLP 2021 submission, Natural Language Processing, Data Collection, Monolingual Speakers, Machine Translation, GIFs, Image
    corecore