Search CORE

40 research outputs found

On the Similarities Between Native, Non-native and Translated Texts

Author: Nisioi Sergiu
Ordan Noam
Rabinovich Ella
Wintner Shuly
Publication venue
Publication date: 01/01/2016
Field of study

We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.Comment: ACL2016, 12 page

arXiv.org e-Print Archive

Crossref

Translationese and post-editese : how comparable is comparable quality?

Author: Daems Joke
De Clercq Orphée
Macken Lieve
Publication venue
Publication date: 01/01/2017
Field of study

Whereas post-edited texts have been shown to be either of comparable quality to human translations or better, one study shows that people still seem to prefer human-translated texts. The idea of texts being inherently different despite being of high quality is not new. Translated texts, for example,are also different from original texts, a phenomenon referred to as ‘Translationese’. Research into Translationese has shown that, whereas humans cannot distinguish between translated and original text,computers have been trained to detect Translationesesuccessfully. It remains to be seen whether the same can be done for what we call Post-editese. We first establish whether humans are capable of distinguishing post-edited texts from human translations, and then establish whether it is possible to build a supervised machine-learning model that can distinguish between translated and post-edited text

Ghent University Academic Bibliography

Native Language Identification with Big Bird Embeddings

Author: Cassani Giovanni
Emmery Chris
Kramp Sergey
Publication venue
Publication date: 13/09/2023
Field of study

Native Language Identification (NLI) intends to classify an author's native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and transformer-based NLI models have thus far failed to offer effective, practical alternatives. The current work investigates if input size is a limiting factor, and shows that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work

arXiv.org e-Print Archive

Identifying Computer-Translated Paragraphs using Coherence Features

Author: Echizen Isao
H. Nguyen Huy
Nguyen-Son Hoang-Quoc
T. Tieu Ngoc-Dung
Yamagishi Junichi
Publication venue
Publication date: 03/12/2018
Field of study

We have developed a method for extracting the coherence features from a paragraph by matching similar words in its sentences. We conducted an experiment with a parallel German corpus containing 2000 human-created and 2000 machine-translated paragraphs. The result showed that our method achieved the best performance (accuracy = 72.3%, equal error rate = 29.8%) when it is compared with previous methods on various computer-generated text including translation and paper generation (best accuracy = 67.9%, equal error rate = 32.0%). Experiments on Dutch, another rich resource language, and a low resource one (Japanese) attained similar performances. It demonstrated the efficiency of the coherence features at distinguishing computer-translated from human-created paragraphs on diverse languages.Comment: 9 pages, PACLIC 201

arXiv.org e-Print Archive

Edinburgh Research Explorer

Identifying Adversarial Sentences by Analyzing Text Complexity

Author: Hidano Seira
Kiyomoto Shinsaku
Nguyen-Son Hoang-Quoc
Thao Tran Phuong
Publication venue: Waseda Institute for the Study of Language and Information
Publication date: 01/01/2019
Field of study

Waseda University Repository

On the differences between human translations

Author: Popović Maja
Publication venue: European Association for Machine Translation (EAMT)
Publication date: 01/11/2020
Field of study

Many studies have confirmed that translated texts exhibit different features than texts originally written in the given language. This work explores texts translated by different translators taking into account expertise and native language. A set of computational analyses was conducted on three language pairs, English-Croatian, German-French and English-Finnish, and the results show that each of the factors has certain influence on the features of the translated texts, especially on sentence length and lexical richness. The results also indicate that for translations used for machine translation evaluation, it is important to specify these factors, especially when comparing machine translation quality with human translation quality

DCU Online Research Access Service

Automatic classification of human translation and machine translation : a study from the perspective of lexical diversity

Author: Fu Yingxue
Nederhof Mark Jan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 31/05/2021
Field of study

By using a trigram model and fine-tuning a pretrained BERT model for sequence classification, we show that machine translation and human translation can be classified with an accuracy above chance level, which suggests that machine translation and human translation are different in a systematic way. The classification accuracy of machine translation is much higher than of human translation. We show that this may be explained by the difference in lexical diversity between machine translation and human translation. If machine translation has independent patterns from human translation, automatic metrics which measure the deviation of machine translation from human translation may conflate difference with quality. Our experiment with two different types of automatic metrics shows correlation with the result of the classification task. Therefore, we suggest the difference in lexical diversity between machine translation and human translation be given more attention in machine translation evaluation.Publisher PD

University of St. Andrews - Pure

St Andrews Research Repository