87 research outputs found

    Improved Coreference Resolution Using Cognitive Insights

    Get PDF
    Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution

    Anaphora resolution for Arabic machine translation :a case study of nafs

    Get PDF
    PhD ThesisIn the age of the internet, email, and social media there is an increasing need for processing online information, for example, to support education and business. This has led to the rapid development of natural language processing technologies such as computational linguistics, information retrieval, and data mining. As a branch of computational linguistics, anaphora resolution has attracted much interest. This is reflected in the large number of papers on the topic published in journals such as Computational Linguistics. Mitkov (2002) and Ji et al. (2005) have argued that the overall quality of anaphora resolution systems remains low, despite practical advances in the area, and that major challenges include dealing with real-world knowledge and accurate parsing. This thesis investigates the following research question: can an algorithm be found for the resolution of the anaphor nafs in Arabic text which is accurate to at least 90%, scales linearly with text size, and requires a minimum of knowledge resources? A resolution algorithm intended to satisfy these criteria is proposed. Testing on a corpus of contemporary Arabic shows that it does indeed satisfy the criteria.Egyptian Government

    Towards Entity Status

    Get PDF
    Discourse entities are an important construct in computational linguistics. They introduce an additional level of representation between referring expressions and that which they refer to: the level of mental representation. In this thesis, I first explore some semiotic and communication theoretic aspects of discourse entities. Then, I develop the concept of "entity status". Entity status is a meta-variable that collects two dimensions formations about the role that an entity plays a discourse, and management informations about how the entity is created, accessed, and updated. Finally, the concept is applied to two case studies: the first one focusses on the choice of referring expressions in radio news, while the second looks at the conditions under which a discourse entity can be mentioned as a pronoun.Diskursentitäten sind ein wichtiger Konstrukt in der Computerlinguistik. Sie führen eine zusätzliche Repräsentationsebene ein zwischen referierenden Ausdrücken, und dem, auf das diese Ausdrücke referieren: die Ebene der mentalen Repräsentation. In dieser Dissertation erkunde ich zunächst einige semiotische und kommunikationstheoretische Aspekte von Diskursentitäten. Danach führe ich den Begriff des "Entitätenstatus" ein. Entitätenstatus ist eine Meta-Variable, die zwei Dimensionen von Information über eine Diskursentität vereinigt: Struktur-Informationen über die Rolle, die eine Entität im Diskurs spielt, und Verwaltungs-Informationen über Erstellung, Zugriff und Update. Dieser Begriff wird schlussendlich auf zwei Fallstudien angewendet: die erste Studie konzentriert sich auf die Wahl referierender Ausdrücke in Radionachrichten, während die zweite Studie die Bedingungen untersucht, in denen eine Diskursentität als Pronomen erwähnt werden kann

    Towards Data-Driven Style Checking: An Example for Law Texts

    Full text link
    We present a novel approach to detecting syntactic structures that are inadequate for their domain context. We define writing style in terms of the choices between alternatives, and conducted an experiment in the legislative domain on the syntactic choice of nominalization in German, i.e. complex noun phrase vs. relative clause. In order to infer the stylistic choices that are conventional in the domain, we capture the contexts that affect the syntactic choice. Our results showed that a data-driven binary classifier can be a viable method for modelling syntactic choices in a style-checking tool

    Projection in discourse:A data-driven formal semantic analysis

    Get PDF
    A sentence like "Bertrand, a famous linguist, wrote a book" contains different contributions: there is a person named "Bertrand", he is a famous linguist, and he wrote a book. These contributions convey different types of information; while the existence of Bertrand is presented as given information---it is presupposed---the other contributions signal new information. Moreover, the contributions are affected differently by linguistic constructions. The inference that Bertrand wrote a book disappears when the sentence is negated or turned into interrogative form, while the other contributions survive; this is called 'projection'. In this thesis, I investigate the relation between different types of contributions in a sentence from a theoretical and empirical perspective. I focus on projection phenomena, which include presuppositions ('Bertrand exists' in the aforementioned example) and conventional implicatures ('Bertrand is a famous linguist'). I argue that the differences between the contributions can be explained in terms of information status, which describes how content relates to the unfolding discourse context. Based on this analysis, I extend the widely used formal representational system Discourse Representation Theory (DRT) with an explicit representation of the different contributions made by projection phenomena; this extension is called 'Projective Discourse Representation Theory' (PDRT). I present a data-driven computational analysis based on data from the Groningen Meaning Bank, a corpus of semantically annotated texts. This analysis shows how PDRT can be used to learn more about different kinds of projection behaviour. These results can be used to improve linguistically oriented computational applications such as automatic translation systems
    corecore