143 research outputs found
Reflections on the Penn Discourse TreeBank, Comparable Corpora and Complementary Annotation
The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. Its focus on discourse relations that are either lexically grounded in explicit discourse connectives or associated with sentential adjacency has not only facilitated its use in language technology and psycholinguistics but also has spawned the annotation of comparable corpora in other languages and genres. Given this situation, this paper has four aims: (1) to provide a comprehensive introduction to the PDTB for those who are unfamiliar with it; (2) to correct some wrong (or perhaps inadvertent) assumptions about the PDTB and its annotation that may have weakened previous results or the performance of decision procedures induced from the data; (3) to explain variations seen in the annotation of comparable resources in other languages and genres, which should allow developers of future comparable resources to recognize whether the variations are relevant to them; and (4) to enumerate and explain relationships between PDTB annotation and comple-mentary annotation of other linguistic phenomena. The paper draws on work done by ourselves and others since the corpus was released
Some considerations on the use of main verbs to express rhetorical relations
Rhetorical relations are typically expressed by discourse structuring devices that ensure
textual cohesion and coherence. Resources such as the PDTB target specifically the
annotation of these devices, while describing alternative lexicalizations of such relations
(AltLex).
Our preparatory work to develop a discourse treebank for Portuguese in the PDTB framework
has provided ground for some considerations regarding the status, in intra-sentential coherence,
of main verbs that internally carry a causative meaning. We have first focused on the annotation of
the rhetorical senses Reason, Result, Pragmatic_justification as expressed explicitly by discourse
structuring devices (conjunctions, adverbs, phrases and prepositions), taken as elements that
express a two-place semantic relation filled by propositional arguments. However, these
relations are also frequently marked by other devices (AltLex).info:eu-repo/semantics/publishedVersio
CRPC-DB – A Discourse Bank for Portuguese
info:eu-repo/semantics/publishedVersio
Discourse Relations and Connectives in Higher Text Structure
The present article investigates possibilities and limits of local (shallow) analysis of discourse coherence with respect to the phenomena of global coherence and higher composition of texts. We study corpora annotated with local discourse relations in Czech and partly in English to try and find clues in the local annotation indicating a higher discourse structure. First, we classify patterns of subsequent or overlapping pairs of local relations, and hierarchies formed by nested local relations. Special attention is then given to relations crossing paragraph boundaries and their semantic types, and to paragraph-initial discourse connectives. In the third part, we examine situations in which annotators incline to marking a large argument (larger than one sentence) of a discourse relation even with a minimality principle annotation rule in place. Our analyses bring (i) new linguistic insights regarding coherence signals in local and higher contexts, e.g. detection and description of hierarchies of local discourse relations up to 5 levels in Czech and English, description of distribution differences in semantic types in cross-paragraph and other settings, identification of Czech connectives only typical for higher structures, or the detection of prevalence of large left-sided arguments in locally annotated data; (ii) as another type of contribution, some new reflections on methodologies of the approaches under scrutiny
Discourse relations and conjoined VPs: automated sense recognition
Sense classification of discourse relations is a sub-task of shallow discourse parsing. Discourse relations can occur both across sentences (inter-sentential) and within sentences (intra-sentential), and more than one discourse relation can hold between the same units. Using a newly available corpus of discourse-annotated intra-sentential conjoined verb phrases, we demonstrate a sequential classification system for their multi-label sense classification. We assess the importance of each feature used in the classification, the feature scope, and what is lost in moving from gold standard manual parses to the output of an off-the-shelf parser
Incorporating Annotator Uncertainty into Representations of Discourse Relations
Annotation of discourse relations is a known difficult task, especially for
non-expert annotators. In this paper, we investigate novice annotators'
uncertainty on the annotation of discourse relations on spoken conversational
data. We find that dialogue context (single turn, pair of turns within speaker,
and pair of turns across speakers) is a significant predictor of confidence
scores. We compute distributed representations of discourse relations from
co-occurrence statistics that incorporate information about confidence scores
and dialogue context. We perform a hierarchical clustering analysis using these
representations and show that weighting discourse relation representations with
information about confidence and dialogue context coherently models our
annotators' uncertainty about discourse relation labels
What's Hard in English RST Parsing? Predictive Models for Error Analysis
Despite recent advances in Natural Language Processing (NLP), hierarchical
discourse parsing in the framework of Rhetorical Structure Theory remains
challenging, and our understanding of the reasons for this are as yet limited.
In this paper, we examine and model some of the factors associated with parsing
difficulties in previous work: the existence of implicit discourse relations,
challenges in identifying long-distance relations, out-of-vocabulary items, and
more. In order to assess the relative importance of these variables, we also
release two annotated English test-sets with explicit correct and distracting
discourse markers associated with gold standard RST relations. Our results show
that as in shallow discourse parsing, the explicit/implicit distinction plays a
role, but that long-distance dependencies are the main challenge, while lack of
lexical overlap is less of a problem, at least for in-domain parsing. Our final
model is able to predict where errors will occur with an accuracy of 76.3% for
the bottom-up parser and 76.6% for the top-down parser.Comment: SIGDIAL 2023 camera-ready; 12 page
- …