6 research outputs found

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    Machine Translation with Many Manually Labeled Discourse Connectives

    Get PDF
    The paper presents machine translation experiments from English to Czech with a large amount of manually annotated discourse connectives. The gold-standard discourse relation annotation leads to better translation performance in ranges of 4–60 % for some ambiguous English connectives and helps to find correct syntactical constructs in Czech for less ambiguous connectives. Automatic scoring confirms the stability of the newly built discourseaware translation systems. Error analysis and human translation evaluation point to the cases where the annotation was most and where less helpful.

    Connective-Lex: A Web-Based Multilingual Lexical Resource for Connectives

    Get PDF
    In this paper, we present a tangible outcome of the TextLink network: a joint online database project displaying and linking existing and newly-created lexicons of discourse connectives in multiple languages. We discuss the definition and demarcation of the class of connectives that should be included in such a resource, and present the syntactic, semantic/pragmatic, and lexicographic information we collected. Further, the technical implementation of the database and the search functionality are presented. We discuss how the multilingual integration of several connective lexicons provides added value for linguistic researchers and other users interested in connectives, by allowing crosslinguistic comparison and a direct linking between discourse relational devices in different languages. Finally, we provide pointers for possible future extensions both in breadth (i.e., by adding lexicons for additional languages) and depth (by extending the information provided for each connective item and by strengthening the crosslinguistic links).Nous présentons dans cet article un résultat tangible du réseau TextLink : un projet conjoint de base de données en ligne, qui montre et relie des lexiques, aussi bien existants que créés récemment, de connecteurs discursifs dans plusieurs langues. Nous commençons par considérer la définition et la délimitation de la classe des connecteurs qui devraient être inclus dans une telle ressource, et nous présentons l’information syntaxique, sémantico-pragmatique et lexicographique que nous avons recueillie. D’autre part, l’implémentation technique de cette base de données et les fonctionnalités de recherche qu’elle permet sont aussi décrites. Nous discutons de quelle manière l’intégration multilingue de plusieurs lexiques de connecteurs apporte une valeur ajoutée aux chercheurs en linguistique et aux autres utilisateurs qui s’intéressent aux connecteurs, en permettant de comparer plusieurs langues et de relier directement les connecteurs dans différentes langues. Pour finir, nous donnons des indications quant à une possible extension future en termes d’ampleur (par exemple, en ajoutant des lexiques pour de nouvelles langues) et de profondeur (en augmentant l’information qui est donnée pour chaque connecteur et en renforçant les liens entre lexiques)

    Elaboration of a RST Chinese Treebank

    Get PDF
    [EN] As a subfield of Artificial Intelligence (AI), Natural Language Processing (NLP) aims to automatically process human languages. Fruitful achievements of variant studies from different research fields for NLP exist. Among these research fields, discourse analysis is becoming more and more popular. Discourse information is crucial for NLP studies. As the most spoken language in the world, Chinese occupy a very important position in NLP analysis. Therefore, this work aims to present a discourse treebank for Chinese, whose theoretical framework is Rhetorical Structure Theory (RST) (Mann and Thompson, 1988). In this work, 50 Chinese texts form the research corpus and the corpus can be consulted from the following aspects: segmentation, central unit (CU) and discourse structure. Finally, we create an open online interface for the Chinese treebank.[EU] Adimen Artifizialaren (AA) barneko arlo bat izanez, Hizkuntzaren Prozesamenduak (HP) giza-hizkuntzak automatikoko prozesatzea du helburu. Arlo horretako ikasketa anitzetan lorpen emankor asko eman dira. Ikasketa-arlo ezberdin horien artean, diskurtso-analisia gero eta ezagunagoa da. Diskurtsoko inforamzioa interes handikoa da HPko ikasketetan. Munduko hiztun gehien duen hizkuntza izanda, txinera aztertzea oso garrantzitsua da HPan egiten ari diren ikasketetarako. Hori dela eta, lan honek txinerako diskurtso-egituraz etiketaturiko zuhaitz-banku bat aurkeztea du helburu, Egitura Erretorikoaren Teoria (EET) (Mann eta Thompson, 1988) oinarrituta. Lan honetan, ikerketa-corpusa 50 testu txinatarrez osatu da, ea zuhaitz-bankua hiru etiketatze-mailatan aurkeztuko da: segmentazioa, unitate zentrala (UZ) eta diskurtso-egitura. Azkenik, corpusa webgune batean argitaratu da zuhaitz-bankua kontsultatzeko

    Inducing Discourse Resources Using Annotation Projection

    Get PDF
    An important aspect of natural language understanding and generation involves the recognition and processing of discourse relations. Building applications such as text summarization, question answering and natural language generation needs human language technology beyond the level of the sentence. To address this need, large scale discourse annotated corpora such as the Penn Discourse Treebank (PDTB; Prasad et al., 2008a) have been developed. Manually constructing discourse resources (e.g. discourse annotated corpora) is expensive, both in terms of time and expertise. As a consequence, such resources are only available for a few languages. In this thesis, we propose an approach that automatically creates two types of discourse resources from parallel texts: 1) PDTB-style discourse annotated corpora and 2) lexicons of discourse connectives. Our approach is based on annotation projection where linguistic annotations are projected from a source language to a target language in parallel texts. Our work has made several theoretical contributions as well as practical contributions to the field of discourse analysis. From a theoretical perspective, we have proposed a method to refine the naive method of discourse annotation projection by filtering annotations that are not supported by parallel texts. Our approach is based on the intersection between statistical word-alignment models and can automatically identify 65% of unsupported projected annotations. We have also proposed a novel approach for annotation projection that is independent of statistical word-alignment models. This approach is more robust to longer discourse connectives than approaches based on statistical word-alignment models. From a practical perspective, we have automatically created the Europarl ConcoDisco corpora from English-French parallel texts of the Europarl corpus (Koehn, 2009). In the Europarl ConcoDisco corpora, around 1 million occurrences of French discourse connectives are automatically aligned to their translation. From the French side of \parcorpus, we have extracted our first significant resource, the FrConcoDisco corpora. To our knowledge, the FrConcoDisco corpora are the first PDTB-style discourse annotated corpora for French where French discourse connectives are annotated with the discourse relations that they signaled. The FrConcoDisco corpora are significant in size as they contain more than 25 times more annotations than the PDTB. To evaluate the FrConcoDisco corpora, we showed how they can be used to train a classifier for the disambiguation of French discourse connectives with a high performance. The second significant resource that we automatically extracted from parallel texts is ConcoLeDisCo. ConcoLeDisCo is a lexicon of French discourse connectives mapped to PDTB discourse relations. While ConcoLeDisCo is useful by itself, as we showed in this thesis, it can be used to improve the coverage of manually constructed lexicons of discourse connectives such as LEXCONN (Roze et al., 2012)

    Discourse-level features for statistical machine translation

    Get PDF
    Machine Translation (MT) has progressed tremendously in the past two decades. The rule-based and interlingua approaches have been superseded by statistical models, which learn the most likely translations from large parallel corpora. System design does not amount anymore to crafting syntactical transfer rules, nor does it rely on a semantic representation of the text. Instead, a statistical MT system learns the most likely correspondences and re-ordering of chunks of source words and target words from parallel corpora that have been word-aligned. With this procedure and millions of parallel source and target language sentences, systems can generate translations that are intelligible and require minimal post-editing efforts from the human user. Nevertheless, it has been recognized that the statistical MT paradigm may fall short of modeling a number of linguistic phenomena that are established beyond the phrase level. Research in statistical MT has addressed discourse phenomena explicitly only in the past four years. When it comes to textual coherence structure, cohesive ties relate sentences and entire paragraphs argumentatively to each other. This text structure has to be rendered appropriately in the target text so that it conveys the same meaning as the source text. The lexical and syntactical means through which these cohesive markers are expressed may diverge considerably between languages. Frequently, these markers include discourse connectives, which are function words such as however, instead, since, while, which relate spans of text to each other, e.g. for temporal ordering, contrast or causality. Moreover, to establish the same temporal ordering of events described in a text, the conjugation of verbs has to be coherently translated. The present thesis proposes methods for integrating discourse features into statistical MT. We pre-process the source text prior to automatic translation, focusing on two specific discourse phenomena: discourse connectives and verb tenses. Hand-crafted rules are not required in our proposal; instead, machine learning classifiers are implemented that learn to recognize discourse relations and predict translations of verb tenses. Firstly, we have designed new sets of semantically-oriented features and classifiers to advance the state of the art in automatic disambiguation of discourse connectives. We hereby profited from our multilingual setting and incorporated features that are based on MT and on the insights we gained from contrastive linguistic analysis of parallel corpora. In their best configurations, our classifiers reach high performances (0.7 to 1.0 F1 score) and can therefore reliably be used to automatically annotate the large corpora needed to train SMT systems. Issues of manual annotation and evaluation are discussed as well, and solutions are provided within new annotation and evaluation procedures. As a second contribution, we implemented entire SMT systems that can make use of the (automatically) annotated discourse information. Overall, the thesis confirms that these techniques are a practical solution that leads to global improvements in translation in ranges of 0.2 to 0.5 BLEU score. Further evaluation reveals that in terms of connectives and verb tenses, our statistical MT systems improve the translation of these phenomena in ranges of up to 25%, depending on the performance of the automatic classifiers and on the data sets used
    corecore