4 research outputs found

    Disambiguating Discourse Connectives for Statistical Machine Translation

    Get PDF
    This paper shows that the automatic labeling of discourse connectives with the relations they signal, prior to machine translation (MT), can be used by phrase-based statistical MT systems to improve their translations. This improvement is demonstrated here when translating from English to four target languages - French, German, Italian and Arabic - using several test sets from recent MT evaluation campaigns. Using automatically labeled data for training, tuning and testing MT systems is beneficial on condition that labels are sufficiently accurate, typically above 70%. To reach such an accuracy, a large array of features for discourse connective labeling (morpho-syntactic, semantic and discursive) are extracted using state-of-the-art tools and exploited in factored MT models. The translation of connectives is improved significantly, between 0.7% and 10% as measured with the dedicated ACT metric. The improvements depend mainly on the level of ambiguity of the connectives in the test sets

    RST Signalling Corpus: A Corpus of Signals of Coherence Relations

    Get PDF
    We present the RST Signalling Corpus (Das et al. in RST signalling corpus, LDC2015T10. https://catalog.ldc.upenn.edu/LDC2015T10, 2015), a corpus annotated for signals of coherence relations. The corpus is developed over the RST Discourse Treebank (Carlson et al. in RST Discourse Treebank, LDC2002T07. https://catalog.ldc.upenn.edu/LDC2002T07, 2002) which is annotated for coherence relations. In the RST Signalling Corpus, these relations are further annotated with signalling information. The corpus includes annotation not only for discourse markers which are considered to be the most typical (or sometimes the only type of) signals in discourse, but also for a wide array of other signals such as reference, lexical, semantic, syntactic, graphical and genre features as potential indicators of coherence relations. We describe the research underlying the development of the corpus and the annotation process, and provide details of the corpus. We also present the results of an inter-annotator agreement study, illustrating the validity and reproducibility of the annotation. The corpus is available through the Linguistic Data Consortium, and can be used to investigate the psycholinguistic mechanisms behind the interpretation of relations through signalling, and also to develop discourse-specific computational systems such as discourse parsing applications

    An Evaluation of the Influence of a Document's Text-Type on the Use of Discourse Relations

    Get PDF
    In this thesis, we will discuss the work we have conducted on the relationship between discourse relations in English documents and their associated text-types. Obtaining an understanding of the text-type of a given document is a step towards identifying its larger discourse schema which, in turn, is instrumental in effectively identifying discourse relations. In order to study the relationship between discourse relations and discourse structures, and the text-type of a document, we have created a corpus of documents belonging to seven distinct text-types, from which we extracted discourse relation annotations using already existing parsers. Utilizing the data obtained, we have studied various ways in which discourse relations and text-types are linked in an effort to better understand how discourse schemas can be identified and subsequently utilized in the automatic extraction of discourse relations. Our experiments have shown that the classification of documents within our seven text-types is still better performed with a bag-of-words approach, but the results obtained with the automatically extracted discourse relations suggest that there is in fact a link between text-types and the use of specific discourse relations. We also found that the various text-types are identified with varying accuracy, with text-types such as 'explanation' and 'report' being harder to identify, regardless of the methods used. Finally, our results also show that the cue phrases used to identify explicitly stated discourse relations are amongst the more informative features of our better performing bag-of-words model, and can be utilized to reduce the feature space of this particular model
    corecore