16 research outputs found

    Cross-lingual and cross-domain discourse segmentation of entire documents

    Get PDF
    Discourse segmentation is a crucial step in building end-to-end discourse parsers. However, discourse segmenters only exist for a few languages and domains. Typically they only detect intra-sentential segment boundaries, assuming gold standard sentence and token segmentation, and relying on high-quality syntactic parses and rich heuristics that are not generally available across languages and domains. In this paper, we propose statistical discourse segmenters for five languages and three domains that do not rely on gold pre-annotations. We also consider the problem of learning discourse segmenters when no labeled data is available for a language. Our fully supervised system obtains 89.5% F1 for English newswire, with slight drops in performance on other domains, and we report supervised and unsupervised (cross-lingual) results for five languages in total.Comment: To appear in Proceedings of ACL 201

    Cross-lingual RST Discourse Parsing

    Get PDF
    Discourse parsing is an integral part of understanding information flow and argumentative structure in documents. Most previous research has focused on inducing and evaluating models from the English RST Discourse Treebank. However, discourse treebanks for other languages exist, including Spanish, German, Basque, Dutch and Brazilian Portuguese. The treebanks share the same underlying linguistic theory, but differ slightly in the way documents are annotated. In this paper, we present (a) a new discourse parser which is simpler, yet competitive (significantly better on 2/3 metrics) to state of the art for English, (b) a harmonization of discourse treebanks across languages, enabling us to present (c) what to the best of our knowledge are the first experiments on cross-lingual discourse parsing.Comment: To be published in EACL 2017, 13 page

    Does syntax help discourse segmentation? Not so much

    Get PDF
    International audienceDiscourse segmentation is the first step in building discourse parsers. Most work on discourse segmentation does not scale to real-world discourse parsing across languages , for two reasons: (i) models rely on constituent trees, and (ii) experiments have relied on gold standard identification of sentence and token boundaries. We therefore investigate to what extent constituents can be replaced with universal dependencies , or left out completely, as well as how state-of-the-art segmenters fare in the absence of sentence boundaries. Our results show that dependency information is less useful than expected, but we provide a fully scalable, robust model that only relies on part-of-speech information, and show that it performs well across languages in the absence of any gold-standard annotation

    Does syntax help discourse segmentation? Not so much

    Get PDF
    International audienceDiscourse segmentation is the first step in building discourse parsers. Most work on discourse segmentation does not scale to real-world discourse parsing across languages , for two reasons: (i) models rely on constituent trees, and (ii) experiments have relied on gold standard identification of sentence and token boundaries. We therefore investigate to what extent constituents can be replaced with universal dependencies , or left out completely, as well as how state-of-the-art segmenters fare in the absence of sentence boundaries. Our results show that dependency information is less useful than expected, but we provide a fully scalable, robust model that only relies on part-of-speech information, and show that it performs well across languages in the absence of any gold-standard annotation

    IMPLICIT REGISTER MARKING IN GERMAN VIA METAPHOR AND METONYMY

    Get PDF
    In this article, we examined implicit register marking through metaphor and metonymy. Specifically, we intended to analyse the ways in which metaphor and metonymy are used to mark register properties in selected text types. As an empirical basis for this investigation, we annotated a corpus of six text types that exhibit diversity along a number of important register properties (e.g., persuasivity, literality/orality, or hierarchical vs. equal relations between interlocutors). Our results show a strong dependence of metaphor and metonymy on persuasivity, whereas no such dependence was found with respect to literality and orality. Instead, we found a new register property, viz., length restriction, to be strongly correlated with metonymy. Previous results on a correlation between metonymy and interlocutor equality were also confirmed.   La marcatura implicita del registro in tedesco attraverso la metafora e la metonimia In questo articolo esaminiamo la marcatura implicita del registro attraverso la metafora e la metonimia. In particolare, vogliamo analizzare i modi in cui la metafora e la metonimia vengono utilizzate per marcare le proprietà di registro in alcuni tipi di testo. Come base empirica per questa indagine, abbiamo annotato un corpus di sei tipi di testo che presentano diversità lungo una serie di importanti proprietà di registro (ad esempio, persuasività, letteralità/oralità o relazioni gerarchiche o paritarie tra interlocutori). I nostri risultati mostrano una forte dipendenza della metafora e della metonimia dalla persuasività, mentre non è stata riscontrata una simile dipendenza rispetto alla letteralità e all'oralità. Abbiamo invece riscontrato una nuova proprietà del registro, ossia la restrizione della lunghezza, fortemente correlata alla metonimia. Sono stati confermati anche i risultati precedenti sulla correlazione tra metonimia e uguaglianza degli interlocutori. &nbsp

    Constructing a Lexicon of Dutch Discourse Connectives

    Get PDF
    We present a lexicon of Dutch Discourse Connectives (DisCoDict). Its content was obtained using a two-step process, in which we first exploited a parallel corpus and a German seed lexicon, and then manually evaluated the candidate entries against existing connective resources for Dutch, using these resources to complete our lexicon. We compared connective definitions in the research traditions of the two languages and accommodated the differences in our final lexicon. The DisCoDict lexicon is made publicly available, both human- and machine-readable, and targeted at practical use cases in the domain of automatic discourse parsing. It also supports manual investigations of discourse structure and its lexical signals

    Neue Wege der linguistischen Diskursforschung: computerbasierte Verfahren der Argumentanalyse

    Get PDF
    Der vorliegende Beitrag beschreibt und diskutiert eine neuartige Verbindung quantitativer und qualitativer Verfahren für die Analyse von Big Data in der linguistischen Diskursforschung. Der vorgestellte Ansatz kombiniert Methoden der diskurslinguistischen Argumentationsanalyse mit Methoden des Linguistischen Text Mining. Das Ziel der Methodenentwicklung ist ein computergestütztes Verfahren für die semi-automatisierte Identifizierung und Analyse von Argumenten in großen Textkorpora. Erprobt wird das Verfahren an einem Diskurs über Infrastrukturmaßnahmen. Im Beitrag werden sprachliche Mittel vorgestellt, die im Korpus gemeinsam auftreten und damit als Merkmale von Argumentmustern betrachtet werden können. Solche Argumentmuster können das Vorkommen von Argumenten und ihren Verwendungsweisen in Texten indizieren

    OB1-reader:A model of word recognition and eye movements in text reading

    Get PDF
    Decades of reading research have led to sophisticated accounts of single-word recognition and, in parallel, accounts of eye-movement control in text reading. Although these two endeavors have strongly advanced the field, their relative independence has precluded an integrated account of the reading process. To bridge the gap, we here present a computational model of reading, OB1-reader, which integrates insights from both literatures. Key features of OB1 are as follows: (1) parallel processing of multiple words, modulated by an attentional window of adaptable size; (2) coding of input through a layer of open bigram nodes that represent pairs of letters and their relative position; (3) activation of word representations based on constituent bigram activity, competition with other word representations and contextual predictability; (4) mapping of activated words onto a spatiotopic sentence-level representation to keep track of word order; and (5) saccade planning, with the saccade goal being dependent on the length and activation of surrounding word units, and the saccade onset being influenced by word recognition. A comparison of simulation results with experimental data shows that the model provides a fruitful and parsimonious theoretical framework for understanding reading behavior
    corecore