629 research outputs found

    Automatic Identification of AltLexes using Monolingual Parallel Corpora

    Full text link
    The automatic identification of discourse relations is still a challenging task in natural language processing. Discourse connectives, such as "since" or "but", are the most informative cues to identify explicit relations; however discourse parsers typically use a closed inventory of such connectives. As a result, discourse relations signaled by markers outside these inventories (i.e. AltLexes) are not detected as effectively. In this paper, we propose a novel method to leverage parallel corpora in text simplification and lexical resources to automatically identify alternative lexicalizations that signal discourse relation. When applied to the Simple Wikipedia and Newsela corpora along with WordNet and the PPDB, the method allowed the automatic discovery of 91 AltLexes.Comment: 6 pages, Proceedings of Recent Advances in Natural Language Processing (RANLP 2017

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Discovery of Ambiguous and Unambiguous Discourse Connectives via Annotation Projection

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 83-92. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Coherence in Machine Translation

    Get PDF
    Coherence ensures individual sentences work together to form a meaningful document. When properly translated, a coherent document in one language should result in a coherent document in another language. In Machine Translation, however, due to reasons of modeling and computational complexity, sentences are pieced together from words or phrases based on short context windows and with no access to extra-sentential context. In this thesis I propose ways to automatically assess the coherence of machine translation output. The work is structured around three dimensions: entity-based coherence, coherence as evidenced via syntactic patterns, and coherence as evidenced via discourse relations. For the first time, I evaluate existing monolingual coherence models on this new task, identifying issues and challenges that are specific to the machine translation setting. In order to address these issues, I adapted a state-of-the-art syntax model, which also resulted in improved performance for the monolingual task. The results clearly indicate how much more difficult the new task is than the task of detecting shuffled texts. I proposed a new coherence model, exploring the crosslingual transfer of discourse relations in machine translation. This model is novel in that it measures the correctness of the discourse relation by comparison to the source text rather than to a reference translation. I identified patterns of incoherence common across different language pairs, and created a corpus of machine translated output annotated with coherence errors for evaluation purposes. I then examined lexical coherence in a multilingual context, as a preliminary study for crosslingual transfer. Finally, I determine how the new and adapted models correlate with human judgements of translation quality and suggest that improvements in general evaluation within machine translation would benefit from having a coherence component that evaluated the translation output with respect to the source text

    Inducing Discourse Resources Using Annotation Projection

    Get PDF
    An important aspect of natural language understanding and generation involves the recognition and processing of discourse relations. Building applications such as text summarization, question answering and natural language generation needs human language technology beyond the level of the sentence. To address this need, large scale discourse annotated corpora such as the Penn Discourse Treebank (PDTB; Prasad et al., 2008a) have been developed. Manually constructing discourse resources (e.g. discourse annotated corpora) is expensive, both in terms of time and expertise. As a consequence, such resources are only available for a few languages. In this thesis, we propose an approach that automatically creates two types of discourse resources from parallel texts: 1) PDTB-style discourse annotated corpora and 2) lexicons of discourse connectives. Our approach is based on annotation projection where linguistic annotations are projected from a source language to a target language in parallel texts. Our work has made several theoretical contributions as well as practical contributions to the field of discourse analysis. From a theoretical perspective, we have proposed a method to refine the naive method of discourse annotation projection by filtering annotations that are not supported by parallel texts. Our approach is based on the intersection between statistical word-alignment models and can automatically identify 65% of unsupported projected annotations. We have also proposed a novel approach for annotation projection that is independent of statistical word-alignment models. This approach is more robust to longer discourse connectives than approaches based on statistical word-alignment models. From a practical perspective, we have automatically created the Europarl ConcoDisco corpora from English-French parallel texts of the Europarl corpus (Koehn, 2009). In the Europarl ConcoDisco corpora, around 1 million occurrences of French discourse connectives are automatically aligned to their translation. From the French side of \parcorpus, we have extracted our first significant resource, the FrConcoDisco corpora. To our knowledge, the FrConcoDisco corpora are the first PDTB-style discourse annotated corpora for French where French discourse connectives are annotated with the discourse relations that they signaled. The FrConcoDisco corpora are significant in size as they contain more than 25 times more annotations than the PDTB. To evaluate the FrConcoDisco corpora, we showed how they can be used to train a classifier for the disambiguation of French discourse connectives with a high performance. The second significant resource that we automatically extracted from parallel texts is ConcoLeDisCo. ConcoLeDisCo is a lexicon of French discourse connectives mapped to PDTB discourse relations. While ConcoLeDisCo is useful by itself, as we showed in this thesis, it can be used to improve the coverage of manually constructed lexicons of discourse connectives such as LEXCONN (Roze et al., 2012)
    corecore