143 research outputs found

    Reflections on the Penn Discourse TreeBank, Comparable Corpora and Complementary Annotation

    Get PDF
    The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. Its focus on discourse relations that are either lexically grounded in explicit discourse connectives or associated with sentential adjacency has not only facilitated its use in language technology and psycholinguistics but also has spawned the annotation of comparable corpora in other languages and genres. Given this situation, this paper has four aims: (1) to provide a comprehensive introduction to the PDTB for those who are unfamiliar with it; (2) to correct some wrong (or perhaps inadvertent) assumptions about the PDTB and its annotation that may have weakened previous results or the performance of decision procedures induced from the data; (3) to explain variations seen in the annotation of comparable resources in other languages and genres, which should allow developers of future comparable resources to recognize whether the variations are relevant to them; and (4) to enumerate and explain relationships between PDTB annotation and comple-mentary annotation of other linguistic phenomena. The paper draws on work done by ourselves and others since the corpus was released

    Some considerations on the use of main verbs to express rhetorical relations

    Get PDF
    Rhetorical relations are typically expressed by discourse structuring devices that ensure textual cohesion and coherence. Resources such as the PDTB target specifically the annotation of these devices, while describing alternative lexicalizations of such relations (AltLex). Our preparatory work to develop a discourse treebank for Portuguese in the PDTB framework has provided ground for some considerations regarding the status, in intra-sentential coherence, of main verbs that internally carry a causative meaning. We have first focused on the annotation of the rhetorical senses Reason, Result, Pragmatic_justification as expressed explicitly by discourse structuring devices (conjunctions, adverbs, phrases and prepositions), taken as elements that express a two-place semantic relation filled by propositional arguments. However, these relations are also frequently marked by other devices (AltLex).info:eu-repo/semantics/publishedVersio

    CRPC-DB – A Discourse Bank for Portuguese

    Get PDF
    info:eu-repo/semantics/publishedVersio

    Discourse Relations and Connectives in Higher Text Structure

    Get PDF
    The present article investigates possibilities and limits of local (shallow) analysis of discourse coherence with respect to the phenomena of global coherence and higher composition of texts. We study corpora annotated with local discourse relations in Czech and partly in English to try and find clues in the local annotation indicating a higher discourse structure. First, we classify patterns of subsequent or overlapping pairs of local relations, and hierarchies formed by nested local relations. Special attention is then given to relations crossing paragraph boundaries and their semantic types, and to paragraph-initial discourse connectives. In the third part, we examine situations in which annotators incline to marking a large argument (larger than one sentence) of a discourse relation even with a minimality principle annotation rule in place. Our analyses bring (i) new linguistic insights regarding coherence signals in local and higher contexts, e.g. detection and description of hierarchies of local discourse relations up to 5 levels in Czech and English, description of distribution differences in semantic types in cross-paragraph and other settings, identification of Czech connectives only typical for higher structures, or the detection of prevalence of large left-sided arguments in locally annotated data; (ii) as another type of contribution, some new reflections on methodologies of the approaches under scrutiny

    Discourse relations and conjoined VPs: automated sense recognition

    Get PDF
    Sense classification of discourse relations is a sub-task of shallow discourse parsing. Discourse relations can occur both across sentences (inter-sentential) and within sentences (intra-sentential), and more than one discourse relation can hold between the same units. Using a newly available corpus of discourse-annotated intra-sentential conjoined verb phrases, we demonstrate a sequential classification system for their multi-label sense classification. We assess the importance of each feature used in the classification, the feature scope, and what is lost in moving from gold standard manual parses to the output of an off-the-shelf parser

    Incorporating Annotator Uncertainty into Representations of Discourse Relations

    Full text link
    Annotation of discourse relations is a known difficult task, especially for non-expert annotators. In this paper, we investigate novice annotators' uncertainty on the annotation of discourse relations on spoken conversational data. We find that dialogue context (single turn, pair of turns within speaker, and pair of turns across speakers) is a significant predictor of confidence scores. We compute distributed representations of discourse relations from co-occurrence statistics that incorporate information about confidence scores and dialogue context. We perform a hierarchical clustering analysis using these representations and show that weighting discourse relation representations with information about confidence and dialogue context coherently models our annotators' uncertainty about discourse relation labels

    What's Hard in English RST Parsing? Predictive Models for Error Analysis

    Full text link
    Despite recent advances in Natural Language Processing (NLP), hierarchical discourse parsing in the framework of Rhetorical Structure Theory remains challenging, and our understanding of the reasons for this are as yet limited. In this paper, we examine and model some of the factors associated with parsing difficulties in previous work: the existence of implicit discourse relations, challenges in identifying long-distance relations, out-of-vocabulary items, and more. In order to assess the relative importance of these variables, we also release two annotated English test-sets with explicit correct and distracting discourse markers associated with gold standard RST relations. Our results show that as in shallow discourse parsing, the explicit/implicit distinction plays a role, but that long-distance dependencies are the main challenge, while lack of lexical overlap is less of a problem, at least for in-domain parsing. Our final model is able to predict where errors will occur with an accuracy of 76.3% for the bottom-up parser and 76.6% for the top-down parser.Comment: SIGDIAL 2023 camera-ready; 12 page
    corecore