1,345 research outputs found

    Coherence in Machine Translation

    Get PDF
    Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and with no access to extra-sentential context. In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as evidenced via discourse relations. For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    A Quantitative Analysis of Discourse Phenomena in Machine Translation

    Get PDF
    State-of-the-art Machine Translation (MT) systems translate documents by considering isolated sentences, disregarding information beyond sentence level. As a result, machine-translated documents often contain problems related to discourse coherence and cohesion. Recently, some initiatives in the evaluation and quality estimation of MT outputs have attempted to detect discourse problems in order to assess the quality of these machine translations. However, a quantitative analysis of discourse phenomena in MT outputs is still needed in order to better understand the phenomena and identify possible solutions or ways to improve evaluation. This paper aims to answer the following questions: What is the impact of discourse phenomena on MT quality? Can we capture and measure quantitatively any issues related to discourse in MT outputs? In order to answer these questions, we present a quantitative analysis of several discourse phenomena and correlate the resulting figures with scores from automatic translation quality evaluation metrics. We show that figures related to discourse phenomena present a higher correlation with quality scores than the baseline counts widely used for quality estimation of MT

    Discourse Cohesion in Chinese-English Statistical Machine Translation

    Get PDF
    In discourse, cohesion is a required component of meaningful and well organised text. It establishes the relationship between different elements in the text using a number of devices such as pronouns, determiners, and conjunctions. In translation a well translated document will display the correct cohesion and use of cohesive devices that are pertinent to the language. However, not all languages have the same cohesive devices or use them in the same way. In statistical machine translation this is a particular barrier to generating smooth translations, especially when sentences in parallel corpora are being treated in isolation and no extra meaning or cohesive context is provided beyond the sentential level. In this thesis, focussing on Chinese 1 and English as the language pair, we examine discourse cohesion in statistical machine translation looking at ways that systems can leverage discourse cues and signals in order to produce smoother translations. We also provide a statistical model that improves translation output by adding additional tokens within text that can be used to leverage extra information. A significant part of this research involved visualising many of the results and system outputs, and so an overview of two important pieces of visualisation software that we developed is also included

    Detecting translingual plagiarism and the backlash against translation plagiarists

    Get PDF
    Os métodos de detecção de plágio registaram melhorias significativas ao longo das últimas décadas e, decorrente da investigação avançada realizada por linguistas computacionais e, sobretudo, por linguistas forenses, é, agora, maisfácil identiVcar estratégias de reutilização de texto simples e soVsticadas. Especificamente, simples algoritmos de comparação de texto criados por linguistas computacionais permitem detectar fácil e (semi-)automaticamente plágio literal,ipsis verbis (i.e. que consiste na reutilização de trechos de texto idênticos em diferentes documentos) como é o caso do Turnitin ou o SafeAssign , embora o desempenho destes métodos tenha tendência a piorar quando a reutilizaçãoé disfarçada através da introdução de alterações ao texto original. Neste caso, são necessárias técnicas linguísticas mais soVsticadas, como a análise de sobreposição lexical (Johnson, 1997), para detectar a reutilização. Contudo, estastécnicas são de aplicação muito limitada em casos de plágio translingue, em que determinado texto é traduzido e reutilizado sem atribuição da autoria ao texto original, proveniente de outra língua. Considerando que (a) normalmente,a tradução amadora (e.g. tradução literal ou tradução automática gratuita) é ométodo utilizado para plagiar; (b) é comum os plagiadores fazerem alterações aotexto, nomeadamente gramaticais e sintácticas, sobretudo após a tradução automática;e (c) os elementos lexicais são aqueles que a tradução automática processamais correctamente, antes da sua reutilização no texto derivado, este artigopropõe um método de detecção de plágio translingue informado pelas teorias datradução e da interlíngua (Selinker, 1972; Bassnett and Lefevere, 1998), bem comopelo princípio de singularidade linguística (Coulthard, 2004). Recorrendo a dadosempíricos do corpus CorRUPT (Corpus of Reused and Plagiarised Texts),um corpus de textos académicos e não académicos reais, que foram investigadose acusados de plagiar textos originais noutras línguas, demonstra-se a utilidadeda metodologia proposta para a detecção de plágio translingue. Finalmente,discute-se possíveis aplicações deste método como ferramenta de investigação emcontextos forenses.Plagiarism detection methods have improved signiVcantly over thelast decades, and as a result of the advanced research conducted by computationaland mostly forensic linguists, simple and sophisticated textual borrowingstrategies can now be identiVed more easily. In particular, simple text comparisonalgorithms developed by computational linguists allow literal, word-for-wordplagiarism (i.e. where identical strings of text are reused across diUerent documents)to be easily detected (semi-)automatically (e.g. Turnitin or SafeAssign),although these methods tend to perform less well when the borrowing is obfuscatedby introducing edits to the original text. In this case, more sophisticatedlinguistic techniques, such as an analysis of lexical overlap (Johnson, 1997), arerequired to detect the borrowing. However, these have limited applicability incases of translingual plagiarism, where a text is translated and borrowed withoutacknowledgment from an original in another language. Considering that(a) traditionally non-professional translation (e.g. literal or free machine translation)is the method used to plagiarise; (b) the plagiarist usually edits the textfor grammar and syntax, especially when machine-translated; and (c) lexicalitems are those that tend to be translated more correctly, and carried over to thederivative text, this paper proposes a method for translingual plagiarism detectionthat is grounded on translation and interlanguage theories (Selinker, 1972;Bassnett and Lefevere, 1998), as well as on the principle of linguistic uniqueness(Coulthard, 2004). Empirical evidence from the CorRUPT corpus (Corpus ofReused and Plagiarised Texts), a corpus of real academic and non-academic textsthat were investigated and accused of plagiarising originals in other languages, isused to illustrate the applicability of the methodology proposed for translingualplagiarism detection. Finally, applications of the method as an investigative toolin forensic contexts are discussed

    Thematic Annotation: extracting concepts out of documents

    Get PDF
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to generate a synthetic representation of the document by aggregating the words present in topically homogeneous document segments into a set of concepts best preserving the document's content. This new extraction technique uses an unexplored approach to topic selection. Instead of using semantic similarity measures based on a semantic resource, the later is processed to extract the part of the conceptual hierarchy relevant to the document content. Then this conceptual hierarchy is searched to extract the most relevant set of concepts to represent the topics discussed in the document. Notice that this algorithm is able to extract generic concepts that are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure

    Proceedings of the 17th Annual Conference of the European Association for Machine Translation

    Get PDF
    Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

    New perspectives on cohesion and coherence: Implications for translation

    Get PDF
    The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation

    New perspectives on cohesion and coherence: Implications for translation

    Get PDF
    The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation

    New perspectives on cohesion and coherence: Implications for translation

    Get PDF
    The contributions to this volume investigate relations of cohesion and coherence as well as instantiations of discourse phenomena and their interaction with information structure in multilingual contexts. Some contributions concentrate on procedures to analyze cohesion and coherence from a corpus-linguistic perspective. Others have a particular focus on textual cohesion in parallel corpora that include both originals and translated texts. Additionally, the papers in the volume discuss the nature of cohesion and coherence with implications for human and machine translation.The contributors are experts on discourse phenomena and textuality who address these issues from an empirical perspective. The chapters in this volume are grounded in the latest research making this book useful to both experts of discourse studies and computational linguistics, as well as advanced students with an interest in these disciplines. We hope that this volume will serve as a catalyst to other researchers and will facilitate further advances in the development of cost-effective annotation procedures, the application of statistical techniques for the analysis of linguistic phenomena and the elaboration of new methods for data interpretation in multilingual corpus linguistics and machine translation
    • …