Search CORE

2 research outputs found

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Author: Essaidi Farah
Fethi Amal
Futeral Matthieu
Muller Benjamin
Ortiz Suárez Pedro Javier
Sagot Benoît
Seddah Djamé
Srivastava Abhishek
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

International audienceWe introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

HAL-Rennes 1

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation

Author: Bawden Rachel
Futeral Matthieu
Laptev Ivan
Sagot Benoît
Schmid Cordelia
Publication venue: HAL CCSD
Publication date: 07/02/2023
Field of study

One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as an image. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism and which is jointly trained on both visual masking and MMT. We also release CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation dataset, composed of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results over strong text-only models on standard English-to-French benchmarks and outperforms these baselines and state-of-the-art MMT systems with a large margin on our contrastive test set

INRIA a CCSD electronic archive server