26 research outputs found

    PoCoS – Potsdam Coreference Scheme

    Get PDF

    Inter-Coder Agreement for Computational Linguistics

    Get PDF
    This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks—but that their use makes the interpretation of the value of the coefficient even harder. </jats:p

    Topic-Continuity and Topic-Shift Effects in Spanish Discourse: A Comparative Analysis of Referring Expressions

    Get PDF
    Differences in use among referring expressions are usually explained on the basis of the cognitive accessibility of their antecedents, where antecedent accessibility has been operationalized differently in the literature; i.e. as a grammatical role, as syntactic prominence or as antecedent distance. On these grounds, it has been proposed that personal pronouns prefer topical antecedents whereas demonstratives prefer non-topical antecedents. This paper investigates the referring properties of Spanish demonstratives and direct object personal pronouns with the aim to unveil their differences and similarities. My analysis shows that these two expressions are very similar referentially when a narrow view of discourse context is considered. However, important differences show up when a broader notion of context is thrown into the picture; i.e. contexts that extend beyond the immediate previous sentence and beyond the immediate local topic of discourse. Based on my corpus evidence and on previous research on the pragmatic interpretation of referring expressions, I claim that direct object personal pronouns and demonstrative noun phrases crucially differ in the way they contribute to discourse coherence; the former playing the role of topic continuity markers and the latter focalising referents that reintroduce suspended or declining topics and marking (sub)-topic shifts in the discourse

    Towards interoperable discourse annotation: discourse features in the Ontologies of Linguistic Annotation

    Get PDF
    This paper describes the extension of the Ontologies of Linguistic Annotation (OLiA) with respect to discourse features. The OLiA ontologies provide a a terminology repository that can be employed to facilitate the conceptual (semantic) interoperability of annotations of discourse phenomena as found in the most important corpora available to the community, including OntoNotes, the RST Discourse Treebank and the Penn Discourse Treebank. Along with selected schemes for information structure and coreference, discourse relations are discussed with special emphasis on the Penn Discourse Treebank and the RST Discourse Treebank. For an example contained in the intersection of both corpora, I show how ontologies can be employed to generalize over divergent annotation schemes

    Review of coreference resolution in English and Persian

    Full text link
    Coreference resolution (CR) is one of the most challenging areas of natural language processing. This task seeks to identify all textual references to the same real-world entity. Research in this field is divided into coreference resolution and anaphora resolution. Due to its application in textual comprehension and its utility in other tasks such as information extraction systems, document summarization, and machine translation, this field has attracted considerable interest. Consequently, it has a significant effect on the quality of these systems. This article reviews the existing corpora and evaluation metrics in this field. Then, an overview of the coreference algorithms, from rule-based methods to the latest deep learning techniques, is provided. Finally, coreference resolution and pronoun resolution systems in Persian are investigated.Comment: 44 pages, 11 figures, 5 table

    Iarg-AnCora: Spanish corpus annotated with implicit arguments

    Get PDF
    This article presents the Spanish Iarg-AnCora corpus (400 k-words, 13,883 sentences) annotated with the implicit arguments of deverbal nominalizations (18,397 occurrences). We describe the methodology used to create it, focusing on the annotation scheme and criteria adopted. The corpus was manually annotated and an interannotator agreement test was conducted (81 % observed agreement) in order to ensure the reliability of the final resource. The annotation of implicit arguments results in an important gain in argument and thematic role coverage (128 % on average). It is the first corpus annotated with implicit arguments for the Spanish language with a wide coverage that is freely available. This corpus can subsequently be used by machine learning-based semantic role labeling systems, and for the linguistic analysis of implicit arguments grounded on real data. Semantic analyzers are essential components of current language technology applications, which need to obtain a deeper understanding of the text in order to make inferences at the highest level to obtain qualitative improvements in the results

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    Harnessing Collective Intelligence on Social Networks

    Get PDF
    Crowdsourcing is an approach to replace the work traditionally done by a single person with the collective action of a group of people via the Internet. It has established itself in the mainstream of research methodology in recent years using a variety of approaches to engage humans in solving problems that computers, as yet, cannot solve. Several common approaches to crowdsourcing have been successful, including peer production (in which the participants are inherently interested in contributing), microworking (in which participants are paid small amounts of money per task) and games or gamification (in which the participants are entertained as they complete the tasks). An alternative approach to crowdsourcing using social networks is proposed here. Social networks offer access to large user communities through integrated software applications and, as they mature, are utilised in different ways, with decentralised and unevenly-distributed organisation of content. This research investigates whether collective intelligence systems are facilitated better on social networks and how the contributed human effort can be optimised. These questions are investigated using two case studies of problem solving: anaphoric coreference in text documents and classifying images in the marine biology domain. Social networks themselves can be considered inherent, self-organised problem solving systems, an approach defined here as ?groupsourcing?, sharing common features with other crowdsourcing approaches; however, the benefits are tempered with the many challenges this approach presents. In comparison to other methods of crowdsourcing, harnessing collective intelligence on social networks offers a high-accuracy, data-driven and low-cost approach
    corecore