37 research outputs found
Multi-modal post-editing of machine translation
As MT quality continues to improve, more and more translators switch from traditional translation from scratch to PE of MT output, which has been shown to save time and reduce errors. Instead of mainly generating text, translators are now asked to correct errors within otherwise helpful translation proposals, where repetitive MT errors make the process tiresome, while hard-to-spot errors make PE a cognitively demanding activity. Our contribution is three-fold: first, we explore whether interaction modalities other than mouse and keyboard could well support PE by creating and testing the MMPE translation environment. MMPE allows translators to cross out or hand-write text, drag and drop words for reordering, use spoken commands or hand gestures to manipulate text, or to combine any of these input modalities. Second, our interviews revealed that translators see value in automatically receiving additional translation support when a high CL is detected during PE. We therefore developed a sensor framework using a wide range of physiological and behavioral data to estimate perceived CL and tested it in three studies, showing that multi-modal, eye, heart, and skin measures can be used to make translation environments cognition-aware. Third, we present two multi-encoder Transformer architectures for APE and discuss how these can adapt MT output to a domain and thereby avoid correcting repetitive MT errors.Angesichts der stetig steigenden Qualität maschineller Übersetzungssysteme (MÜ) post-editieren (PE) immer mehr Übersetzer die MÜ-Ausgabe, was im Vergleich zur herkömmlichen Übersetzung Zeit spart und Fehler reduziert. Anstatt primär Text zu generieren, müssen Übersetzer nun Fehler in ansonsten hilfreichen Übersetzungsvorschlägen korrigieren. Dennoch bleibt die Arbeit durch wiederkehrende MÜ-Fehler mühsam und schwer zu erkennende Fehler fordern die Übersetzer kognitiv. Wir tragen auf drei Ebenen zur Verbesserung des PE bei: Erstens untersuchen wir, ob andere Interaktionsmodalitäten als Maus und Tastatur das PE unterstützen können, indem wir die Übersetzungsumgebung MMPE entwickeln und testen. MMPE ermöglicht es, Text handschriftlich, per Sprache oder über Handgesten zu verändern, Wörter per Drag & Drop neu anzuordnen oder all diese Eingabemodalitäten zu kombinieren. Zweitens stellen wir ein Sensor-Framework vor, das eine Vielzahl physiologischer und verhaltensbezogener Messwerte verwendet, um die kognitive Last (KL) abzuschätzen. In drei Studien konnten wir zeigen, dass multimodale Messung von Augen-, Herz- und Hautmerkmalen verwendet werden kann, um Übersetzungsumgebungen an die KL der Übersetzer anzupassen. Drittens stellen wir zwei Multi-Encoder-Transformer-Architekturen für das automatische Post-Editieren (APE) vor und erörtern, wie diese die MÜ-Ausgabe an eine Domäne anpassen und dadurch die Korrektur von sich wiederholenden MÜ-Fehlern vermeiden können.Deutsche Forschungsgemeinschaft (DFG), Projekt MMP
Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution
Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
Visual context for verb sense disambiguation and multilingual representation learning
Every day billions of images are uploaded to the web. To process images at such a large
scale it is important to build automatic image understanding systems. An important
step towards understanding the content of the images is to be able to understand all the
objects, scenes and actions depicted in the image. These systems should be capable of
integrating with natural language or text to be able to query and interact with humans
for tasks such as image retrieval.
Verbs play a key role in the understanding of sentences and scenes. Verbs express
the semantics of an actions as well as the interactions between objects participating in
an event. Thus understanding verbs is central to both language and image understanding.
However, verbs are known for their variability in meaning with context. Many studies in
psychology have shown that contextual information plays an important role in semantic
understanding and processing in the human visual system. We use this as intuition
and understand the role of textual or visual context in tasks that combine language and
vision.
Our research presented in this thesis focuses on the problems of integrating visual
and textual contexts for: (i) automatically identifying verbs that denote actions depicted
in the images; (ii) fine-grained analysis of how visual context can help disambiguate
different meanings of verbs in a language or across languages; (iii) the role played by
the visual and multilingual context in learning representations that allow us to query
information across modalities and languages.
First, we propose the task of visual sense disambiguation, an alternative way of
addressing the action recognition task. Instead of identifying the actions directly, we
develop a two step process: identifying the verb that denotes the action being depicted
in an image and then disambiguate the meaning of the verb based on the visual and
textual context associated with the image. We first build a image-verb classifier based
on the weak signal from image description data and analyse the specific regions that
model focuses on while predicting the verb. We then disambiguate the meaning of
the verb shown in the image using image features and sense-inventories. We test the
hypothesis that visual and textual context associated with the image contribute to the
disambiguation task.
Second, we ask whether the predictions made by such models correspond to human
intuitions about visual verbs or actions. We analyse whether the image regions a verb
prediction model identifies as salient for a given verb correlate with the regions fixated
by human observers performing an action classification task. We also compare the
correlation of human fixations against visual saliency and center bias models.
Third, we propose the crosslingual verb disambiguation task: identifying the correct
translation of the verb in a target language based on visual context. This task has the
potential to resolve lexical ambiguity in machine translation when the visual context
is available. We propose a series of models and show that multimodal models that
fuse textual information with visual features have an edge over text or visual only
models. We then demonstrate how visual sense disambiguation can be combined with
lexical constraint decoding to improve the performance of a standard unimodal machine
translation system on image descriptions.
Finally, we move on to learn joint representations for images and text in multiple
languages. We test the hypothesis that context provided as visual information or text
in other language contributes to better representation learning. We propose models to
map text from multiple languages and images into a common space and evaluating
the usefulness of the second language in multimodal search and usefulness of image
in the crosslingual search. Our experiments suggest that exploiting multilingual and
multimodal resources can help in learning better semantic representations that are useful
for various multimodal natural language understanding tasks.
Our experiments on visual sense disambiguation, sense disambiguation across languages,
multimodal and cross-lingual search demonstrate that visual context alone or
combined with textual context is useful for enhancing multimodal and crosslingual
applications