40,749 research outputs found

    ProppML: A Complete Annotation Scheme for Proppian Morphologies

    Get PDF
    We give a preliminary description of ProppML, an annotation scheme designed to capture all the components of a Proppian-style morphological analysis of narratives. This work represents the first fully complete annotation scheme for Proppian morphologies, going beyond previous annotation schemes such as PftML, ProppOnto, Bod et al., and our own prior work. Using ProppML we have annotated Propp\u27s morphology on fifteen tales (18,862 words) drawn from his original corpus of Russian folktales. This is a significantly larger set of data than annotated in previous studies. This pilot corpus was constructed via double annotation by two highly trained annotators, whose annotations were then combined after discussion with a third highly trained adjudicator, resulting in gold standard data which is appropriate for training machine learning algorithms. Agreement measures calculated between both annotators show very good agreement (F_1>0.75, kappa>0.9 for functions; F_1>0.6 for moves; and F_1>0.8, kappa>0.6 for dramatis personae). This is the first robust demonstration of reliable annotation of Propp\u27s system

    Towards a Corpus of Historical German Plays with Emotion Annotations

    Get PDF
    In this paper, we present first work-in-progress annotation results of a project investigating computational methods of emotion analysis for historical German plays around 1800. We report on the development of an annotation scheme focussing on the annotation of emotions that are important from a literary studies perspective for this time span as well as on the annotation process we have developed. We annotate emotions expressed or attributed by characters of the plays in the written texts. The scheme consists of 13 hierarchically structured emotion concepts as well as the source (who experiences or attributes the emotion) and target (who or what is the emotion directed towards). We have conducted the annotation of five example plays of our corpus with two annotators per play and report on annotation distributions and agreement statistics. We were able to collect over 6,500 emotion annotations and identified a fair agreement for most concepts around a ?-value of 0.4. We discuss how we plan to improve annotator consistency and continue our work. The results also have implications for similar projects in the context of Digital Humanities

    Situation entity annotation

    Get PDF
    This paper presents an annotation scheme for a new semantic annotation task with relevance for analysis and computation at both the clause level and the discourse level. More specifically, we label the finite clauses of texts with the type of situation entity (e.g., eventualities, statements about kinds, or statements of belief) they introduce to the discourse, following and extending work by Smith (2003). We take a feature-driven approach to annotation, with the result that each clause is also annotated with fundamental aspectual class, whether the main NP referent is specific or generic, and whether the situation evoked is episodic or habitual. This annotation is performed (so far) on three sections of the MASC corpus, with each clause labeled by at least two annotators. In this paper we present the annotation scheme, statistics of the corpus in its current version, and analyses of both inter-annotator agreement and intra-annotator consistency

    The good, the bad and the implicit: a comprehensive approach to annotating explicit and implicit sentiment

    Get PDF
    We present a fine-grained scheme for the annotation of polar sentiment in text, that accounts for explicit sentiment (so-called private states), as well as implicit expressions of sentiment (polar facts). Polar expressions are annotated below sentence level and classified according to their subjectivity status. Additionally, they are linked to one or more targets with a specific polar orientation and intensity. Other components of the annotation scheme include source attribution and the identification and classification of expressions that modify polarity. In previous research, little attention has been given to implicit sentiment, which represents a substantial amount of the polar expressions encountered in our data. An English and Dutch corpus of financial newswire, consisting of over 45,000 words each, was annotated using our scheme. A subset of this corpus was used to conduct an inter-annotator agreement study, which demonstrated that the proposed scheme can be used to reliably annotate explicit and implicit sentiment in real-world textual data, making the created corpora a useful resource for sentiment analysis

    TWITTIRÒ: an Italian Twitter Corpus with a Multi-layered Annotation for Irony

    Get PDF
    Provided the difficulties that still affect a correct identification of irony within the context of Sentiment Analysis tasks, in this paper we describe the main issues emerged during the development of a novel resource for Italian annotated for irony. The project mainly consists in the application on the Twitter corpus TWITTIRÒ of a multi-layered scheme for the fine-grained annotation of irony, as proposed in a multilingual setting and previously applied also on French and English datasets (Karoui et al. 2017). In applying the annotation on this corpus, we outline and discuss the issues and peculiarities emerged about the exploitation of the semantic scheme for Twitter textual messages in Italian, thus shedding some lights on the future directions that can be followed in the multilingual and cross-language perspective too. We present, in particular, an analysis of the annotation process and distribution of the labels of each layer involved in the scheme. This is supported by a discussion of the outcome of the annotation carried on by native Italian speakers in the development of the corpus. In particular, an in-depth discussion of the inter-annotator agreement and of the sources of disagreement is included. The result is a novel gold standard corpus for irony detection in Italian, which enriches the scenario of multilingual datasets available for this challenging task and is ready to be used as a benchmark in automatic irony detection experiments and evaluation campaigns

    Linking discourse modes and situation entity types in a cross-linguistic corpus study

    Get PDF
    The main contribution of this paper is a cross-linguistic empirical analysis of two interacting levels of linguistic analysis of written text: situation entity (SE) types, the semantic types of situations evoked by clauses of text, and discourse modes (DMs), a characterization of passages at the sub-document level. We adapt an existing annotation scheme for SEs in English to be used for German data, with a detailed discussion of the most important differences. We create the first parallel corpus annotated for SEs, and the first DM-annotated corpus. We find that: (a) the adapted scheme is supported by evidence from a large-scale experimental study; (b) SEs mainly correspond to each other in parallel text, and a large part of the mismatches are systematic; (c) the DM annotation task can be performed intuitively with reasonable agreement; and (d) the annotated DMs show the predicted differences in the distributions of SE types

    Iarg-AnCora: Spanish corpus annotated with implicit arguments

    Get PDF
    This article presents the Spanish Iarg-AnCora corpus (400 k-words, 13,883 sentences) annotated with the implicit arguments of deverbal nominalizations (18,397 occurrences). We describe the methodology used to create it, focusing on the annotation scheme and criteria adopted. The corpus was manually annotated and an interannotator agreement test was conducted (81 % observed agreement) in order to ensure the reliability of the final resource. The annotation of implicit arguments results in an important gain in argument and thematic role coverage (128 % on average). It is the first corpus annotated with implicit arguments for the Spanish language with a wide coverage that is freely available. This corpus can subsequently be used by machine learning-based semantic role labeling systems, and for the linguistic analysis of implicit arguments grounded on real data. Semantic analyzers are essential components of current language technology applications, which need to obtain a deeper understanding of the text in order to make inferences at the highest level to obtain qualitative improvements in the results

    The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain

    Get PDF
    This paper presents a new challenging information extraction task in the domain of materials science. We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications, such as involved materials and measurement conditions. With this paper, we publish our annotation guidelines, as well as our SOFC-Exp corpus consisting of 45 open-access scholarly articles annotated by domain experts. A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition and slot filling tasks as well as high annotation quality. We also present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set. On all tasks, using BERT embeddings leads to large performance gains, but with increasing task complexity, adding a recurrent neural network on top seems beneficial. Our models will serve as competitive baselines in future work, and analysis of their performance highlights difficult cases when modeling the data and suggests promising research directions.Comment: Accepted for publication at ACL 202

    Annotated Corpus for Citation Context Analysis

    Get PDF
    In this paper, we present a corpus composed of 85 scientific articles annotated with 2092 citations analyzed using context analysis. We obtained a high Inter-annotator agreement; therefore, we assure reliability and reproducibility of the annotation performed by three coders in an independent way. We applied this corpus to classify citations according to qualitative criteria using a medium granularity categorization scheme enriched by annotated keywords and labels to obtain high granularity. The annotation schema handle three dimensions: PURPOSE: POLARITY: ASPECTS. Citation purpose define functions classification: use, critique, comparison and background with more specific classes stablished using keywords: Based on, Supply; Useful; Contrast; Acknowledge, Corroboration, Debate; Weakness and Hedges. Citation aspects complement the citation characterization: concept, method, data, tool, task, among others. Polarity has three levels: Positive, Negative and Neutral. We developed the schema and annotated the corpus focusing in applications for citation influence assessment, but we suggest that applications as summary generation and information retrieval also could use this annotated corpus because of the organization of the scheme in clearly defined general dimensions
    corecore