1,345 research outputs found
Coherence in Machine Translation
Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and
with no access to extra-sentential context.
In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as
evidenced via discourse relations.
For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax
model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined
lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text
Discourse Structure in Machine Translation Evaluation
In this article, we explore the potential of using sentence-level discourse
structure for machine translation evaluation. We first design discourse-aware
similarity measures, which use all-subtree kernels to compare discourse parse
trees in accordance with the Rhetorical Structure Theory (RST). Then, we show
that a simple linear combination with these measures can help improve various
existing machine translation evaluation metrics regarding correlation with
human judgments both at the segment- and at the system-level. This suggests
that discourse information is complementary to the information used by many of
the existing evaluation metrics, and thus it could be taken into account when
developing richer evaluation metrics, such as the WMT-14 winning combined
metric DiscoTKparty. We also provide a detailed analysis of the relevance of
various discourse elements and relations from the RST parse trees for machine
translation evaluation. In particular we show that: (i) all aspects of the RST
tree are relevant, (ii) nuclearity is more useful than relation type, and (iii)
the similarity of the translation RST tree to the reference tree is positively
correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse
analysis. Computational Linguistics, 201
A Quantitative Analysis of Discourse Phenomena in Machine Translation
State-of-the-art Machine Translation (MT) systems translate documents by considering isolated sentences, disregarding information beyond sentence level. As a result, machine-translated documents often contain problems related to discourse coherence and cohesion. Recently, some initiatives in the evaluation and quality estimation of MT outputs have attempted to detect discourse problems in order to assess the quality of these machine translations. However, a quantitative analysis of discourse phenomena in MT outputs is still needed in order to better understand the phenomena and identify possible solutions or ways to improve evaluation. This paper aims to answer the following questions: What is the impact of discourse phenomena on MT quality? Can we capture and measure quantitatively any issues related to discourse in MT outputs? In order to answer these questions, we present a quantitative analysis of several discourse phenomena and correlate the resulting figures with scores from automatic translation quality evaluation metrics. We show that figures related to discourse phenomena present a higher correlation with quality scores than the baseline counts widely used for quality estimation of MT
Discourse Cohesion in Chinese-English Statistical Machine Translation
In discourse, cohesion is a required component of meaningful and well organised text.
It establishes the relationship between different elements in the text using a number of
devices such as pronouns, determiners, and conjunctions.
In translation a well translated document will display the correct cohesion and use of
cohesive devices that are pertinent to the language. However, not all languages have the
same cohesive devices or use them in the same way. In statistical machine translation
this is a particular barrier to generating smooth translations, especially when sentences in
parallel corpora are being treated in isolation and no extra meaning or cohesive context is
provided beyond the sentential level.
In this thesis, focussing on Chinese 1 and English as the language pair, we examine
discourse cohesion in statistical machine translation looking at ways that systems can leverage discourse cues and signals in order to produce smoother translations. We also provide a statistical model that improves translation output by adding additional tokens within text that can be used to leverage extra information.
A significant part of this research involved visualising many of the results and system outputs, and so an overview of two important pieces of visualisation software that we
developed is also included
Detecting translingual plagiarism and the backlash against translation plagiarists
Os mĂ©todos de detecção de plágio registaram melhorias significativas ao longo das Ăşltimas dĂ©cadas e, decorrente da investigação avançada realizada por linguistas computacionais e, sobretudo, por linguistas forenses, Ă©, agora, maisfácil identiVcar estratĂ©gias de reutilização de texto simples e soVsticadas. Especificamente, simples algoritmos de comparação de texto criados por linguistas computacionais permitem detectar fácil e (semi-)automaticamente plágio literal,ipsis verbis (i.e. que consiste na reutilização de trechos de texto idĂŞnticos em diferentes documentos) como Ă© o caso do Turnitin ou o SafeAssign , embora o desempenho destes mĂ©todos tenha tendĂŞncia a piorar quando a reutilizaçãoĂ© disfarçada atravĂ©s da introdução de alterações ao texto original. Neste caso, sĂŁo necessárias tĂ©cnicas linguĂsticas mais soVsticadas, como a análise de sobreposição lexical (Johnson, 1997), para detectar a reutilização. Contudo, estastĂ©cnicas sĂŁo de aplicação muito limitada em casos de plágio translingue, em que determinado texto Ă© traduzido e reutilizado sem atribuição da autoria ao texto original, proveniente de outra lĂngua. Considerando que (a) normalmente,a tradução amadora (e.g. tradução literal ou tradução automática gratuita) Ă© omĂ©todo utilizado para plagiar; (b) Ă© comum os plagiadores fazerem alterações aotexto, nomeadamente gramaticais e sintácticas, sobretudo apĂłs a tradução automática;e (c) os elementos lexicais sĂŁo aqueles que a tradução automática processamais correctamente, antes da sua reutilização no texto derivado, este artigopropõe um mĂ©todo de detecção de plágio translingue informado pelas teorias datradução e da interlĂngua (Selinker, 1972; Bassnett and Lefevere, 1998), bem comopelo princĂpio de singularidade linguĂstica (Coulthard, 2004). Recorrendo a dadosempĂricos do corpus CorRUPT (Corpus of Reused and Plagiarised Texts),um corpus de textos acadĂ©micos e nĂŁo acadĂ©micos reais, que foram investigadose acusados de plagiar textos originais noutras lĂnguas, demonstra-se a utilidadeda metodologia proposta para a detecção de plágio translingue. Finalmente,discute-se possĂveis aplicações deste mĂ©todo como ferramenta de investigação emcontextos forenses.Plagiarism detection methods have improved signiVcantly over thelast decades, and as a result of the advanced research conducted by computationaland mostly forensic linguists, simple and sophisticated textual borrowingstrategies can now be identiVed more easily. In particular, simple text comparisonalgorithms developed by computational linguists allow literal, word-for-wordplagiarism (i.e. where identical strings of text are reused across diUerent documents)to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign),although these methods tend to perform less well when the borrowing is obfuscatedby introducing edits to the original text. In this case, more sophisticatedlinguistic techniques, such as an analysis of lexical overlap (Johnson, 1997), arerequired to detect the borrowing. However, these have limited applicability incases of translingual plagiarism, where a text is translated and borrowed withoutacknowledgment from an original in another language. Considering that(a) traditionally non-professional translation (e.g. literal or free machine translation)is the method used to plagiarise; (b) the plagiarist usually edits the textfor grammar and syntax, especially when machine-translated; and (c) lexicalitems are those that tend to be translated more correctly, and carried over to thederivative text, this paper proposes a method for translingual plagiarism detectionthat is grounded on translation and interlanguage theories (Selinker, 1972;Bassnett and Lefevere, 1998), as well as on the principle of linguistic uniqueness(Coulthard, 2004). Empirical evidence from the CorRUPT corpus (Corpus ofReused and Plagiarised Texts), a corpus of real academic and non-academic textsthat were investigated and accused of plagiarising originals in other languages, isused to illustrate the applicability of the methodology proposed for translingualplagiarism detection. Finally, applications of the method as an investigative toolin forensic contexts are discussed
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT
New perspectives on cohesion and coherence: Implications for translation
The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
New perspectives on cohesion and coherence: Implications for translation
The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
New perspectives on cohesion and coherence: Implications for translation
The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
- …