46 research outputs found
Translationese and post-editese : how comparable is comparable quality?
Whereas post-edited texts have been shown to be either of comparable quality to human translations or better, one study shows that people still seem to prefer human-translated texts. The idea of texts being inherently different despite being of high quality is not new. Translated texts, for example,are also different from original texts, a phenomenon referred to as ‘Translationese’. Research into Translationese has shown that, whereas humans cannot distinguish between translated and original text,computers have been trained to detect Translationesesuccessfully. It remains to be seen whether the same can be done for what we call Post-editese. We first establish whether humans are capable of distinguishing post-edited texts from human translations, and then establish whether it is possible to build a supervised machine-learning model that can distinguish between translated and post-edited text
Detecting machine-translated subtitles in large parallel corpora
Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus
MultiVENT: Multilingual Videos of Events with Aligned Natural Text
Everyday news coverage has shifted from traditional broadcasts towards a wide
range of presentation formats such as first-hand, unedited video footage.
Datasets that reflect the diverse array of multimodal, multilingual news
sources available online could be used to teach models to benefit from this
shift, but existing news video datasets focus on traditional news broadcasts
produced for English-speaking audiences. We address this limitation by
constructing MultiVENT, a dataset of multilingual, event-centric videos
grounded in text documents across five target languages. MultiVENT includes
both news broadcast videos and non-professional event footage, which we use to
analyze the state of online news videos and how they can be leveraged to build
robust, factually accurate models. Finally, we provide a model for complex,
multilingual video retrieval to serve as a baseline for information retrieval
using MultiVENT