4,397 research outputs found

    A Corpus-Based Investigation of Definite Description Use

    Full text link
    We present the results of a study of definite descriptions use in written texts aimed at assessing the feasibility of annotating corpora with information about definite description interpretation. We ran two experiments, in which subjects were asked to classify the uses of definite descriptions in a corpus of 33 newspaper articles, containing a total of 1412 definite descriptions. We measured the agreement among annotators about the classes assigned to definite descriptions, as well as the agreement about the antecedent assigned to those definites that the annotators classified as being related to an antecedent in the text. The most interesting result of this study from a corpus annotation perspective was the rather low agreement (K=0.63) that we obtained using versions of Hawkins' and Prince's classification schemes; better results (K=0.76) were obtained using the simplified scheme proposed by Fraurud that includes only two classes, first-mention and subsequent-mention. The agreement about antecedents was also not complete. These findings raise questions concerning the strategy of evaluating systems for definite description interpretation by comparing their results with a standardized annotation. From a linguistic point of view, the most interesting observations were the great number of discourse-new definites in our corpus (in one of our experiments, about 50% of the definites in the collection were classified as discourse-new, 30% as anaphoric, and 18% as associative/bridging) and the presence of definites which did not seem to require a complete disambiguation.Comment: 47 pages, uses fullname.sty and palatino.st

    Alts, Abbreviations, and AKAs:historical onomastic variation and automated named entity recognition

    Get PDF
    The accurate automated identification of named places is a major concern for scholars in the digital humanities, and especially for those engaged in research that depends upon the gazetteer-led recognition of specific aspects. The field of onomastics examines the linguistic roots and historical development of names, which have for the most part only standardised into single officially recognised forms since the late nineteenth century. Even slight spelling variations can introduce errors in geotagging techniques, and these differences in place-name spellings are thus vital considerations when seeking high rates of correct geospatial identification in historical texts. This article offers an overview of typical name-based variation that can cause issues in the accurate geotagging of any historical resource. The article argues that the careful study and documentation of these variations can assist in the development of more complete onymic records, which in turn may inform geotaggers through a cycle of variational recognition. It demonstrates how patterns in regional naming variation and development, across both specific and generic name elements, can be identified through the historical records of each known location. The article uses examples taken from a digitised corpus of writing about the English Lake District, a collection of 80 texts that date from between 1622 and 1900. Four of the more complex spelling-based problems encountered during the creation of a manual gazetteer for this corpus are examined. Specifically, the article demonstrates how and why such variation must be expected, particularly in the years preceding the standardisation of place-name spellings. It suggests how procedural developments may be undertaken to account for such georeferential issues in the Named Entity Recognition strategies employed by future projects. Similarly, the benefits of such multi-genre corpora to assist in completing onomastic records is also shown through examples of new name forms discovered for prominent sites in the Lake District. This focus is accompanied by a discussion of the influence of literary works on place-name standardisation – an aspect not typically accounted for in traditional onomastic study – to illustrate the extent to which authorial interests in regional toponymic histories can influence linguistic development

    Under-explicit and minimally explicit reference: Evidence from a longitudinal case study

    Get PDF
    This chapter reports on a 2 ½ year longitudinal case study of one Korean speaker of English, focusing on the development of her command of accessibility marking in referring to persons. The data are derived from informal, open interviews spanning the entire length of the participant’s enrolment in a Bachelor of Nursing programme in New Zealand. These interviews occurred every few weeks during semester (17 in total), and were typically between 45 minutes to one hour in length. The participant reported that she used these interviews as “a kind of reflective journal”, in which she discussed her classes, interactions with classmates, tutors and others, her assignments, and other experiences in New Zealand. The events she reported are rich in references to individuals. Using a previously reported coding scheme (Ryan, 2015), these data were analysed in relation to pragmatic felicity, particularly concerning the felicity of accessibility marking for referents of varying cognitive status in contexts of topic or focus continuity or shift. These data [yet to be analysed] provide evidence of the developmental progression of the participant’s command of reference in English. This chapter contributes substantially to the literature in several ways. In general, there has been a lack of longitudinal case studies of pragmatic development in any domain, including few – if any – previous longitudinal studies focusing on reference; the present analysis is therefore expected to reveal previously unreported details of the trajectory of pragmatic development in reference. The present study is also one of the few working with oral data that was generated in ways other than an elicited communication task. Finally, the study contributes to the somewhat still contentious issue of to what extent mainstream study in an English-speaking context leads to genuine language gains

    Qualities, objects, sorts, and other treasures : gold digging in English and Arabic

    Get PDF
    In the present monograph, we will deal with questions of lexical typology in the nominal domain. By the term "lexical typology in the nominal domain", we refer to crosslinguistic regularities in the interaction between (a) those areas of the lexicon whose elements are capable of being used in the construction of "referring phrases" or "terms" and (b) the grammatical patterns in which these elements are involved. In the traditional analyses of a language such as English, such phrases are called "nominal phrases". In the study of the lexical aspects of the relevant domain, however, we will not confine ourselves to the investigation of "nouns" and "pronouns" but intend to take into consideration all those parts of speech which systematically alternate with nouns, either as heads or as modifiers of nominal phrases. In particular, this holds true for adjectives both in English and in other Standard European Languages. It is well known that adjectives are often difficult to distinguish from nouns, or that elements with an overt adjectival marker are used interchangeably with nouns, especially in particular semantic fields such as those denoting MATERIALS or NATlONALlTIES. That is, throughout this work the expression "lexical typology in the nominal domain" should not be interpreted as "a typology of nouns", but, rather, as the cross-linguistic investigation of lexical areas constitutive for "referring phrases" irrespective of how the parts-of-speech system in a specific language is defined

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    PoCoS – Potsdam Coreference Scheme

    Get PDF

    Improved Coreference Resolution Using Cognitive Insights

    Get PDF
    Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution

    Acquiring Word-Meaning Mappings for Natural Language Interfaces

    Full text link
    This paper focuses on a system, WOLFIE (WOrd Learning From Interpreted Examples), that acquires a semantic lexicon from a corpus of sentences paired with semantic representations. The lexicon learned consists of phrases paired with meaning representations. WOLFIE is part of an integrated system that learns to transform sentences into representations such as logical database queries. Experimental results are presented demonstrating WOLFIE's ability to learn useful lexicons for a database interface in four different natural languages. The usefulness of the lexicons learned by WOLFIE are compared to those acquired by a similar system, with results favorable to WOLFIE. A second set of experiments demonstrates WOLFIE's ability to scale to larger and more difficult, albeit artificially generated, corpora. In natural language acquisition, it is difficult to gather the annotated data needed for supervised learning; however, unannotated data is fairly plentiful. Active learning methods attempt to select for annotation and training only the most informative examples, and therefore are potentially very useful in natural language applications. However, most results to date for active learning have only considered standard classification tasks. To reduce annotation effort while maintaining accuracy, we apply active learning to semantic lexicons. We show that active learning can significantly reduce the number of annotated examples required to achieve a given level of performance

    Syntactic and Semantic Patterns of Domain-specific Multiword Units in Marine Accident Investigation Reports

    Get PDF
    The present study is a systematic corpus-based investigation of the domain-specific multiword units (henceforth MWUs) in marine accident investigation reports (henceforth MAIR), with a view to characterizing their most prominent syntactic, semantic and functional features. To achieve these principal objectives, the target MWUs were first identified by applying a new approach, which incorporates the notion of ‘meaning’ into statistical-based measures. This method ensures the domain-specific MWU extraction to the largest extent and provides valid data for the subsequent analysis. Through proposing a three-dimensional analytical framework, this study has obtained the following findings: First, the domain-specific MWUs are largely composed of two-word sequences, while the occurrences of 4- and 5-word MWUs are relatively rare. Among all the target MWUs, only 1.10% of the expressions occur very commonly within the genre (˚1,000 times). By contrast, the majority of the expressions (70.97%) occur with the frequency less than 100 times. The skewed distribution indicates that MAIR genre tends to employ a wide variety of domain-specific MWUs rather than repetition of a small number of common expressions. Second, in terms of the syntactic features of the domain-specific MWUs, NP structure is the most commonly employed grammatical type. The abundant use of this structure implies that the domain-specific meaning of MAIR genre is largely carried in the nominal group. Apart from NP structure, there is also a marked prevalence of VP structures among the domain-specific MWUs in MAIR genre and these MWUs present structural variation. Of all the VP-based patterns, the ‘verb phrase with active verb’ pattern stands out since it incorporates a large number of action verbs, which are used to describe the actions done by people. The wide use of these phrases implies that MAIR genre tends to highlight the people’s roles during the accidents, with particular attention to the information about what or who caused or performed the activity. Similarly, PP structures were also frequently adopted by the domain-specific MWUs, especially the pattern beginning with preposition of. This pattern was mostly used to specify possessions. It thus can be inferred that the information that provided in MAIR genre tends to be concrete and specific. Third, by conducting a functional analysis of the target MWUs, it was found that the primary function of the domain-specific MWUs is to express referential meanings and contribute to the thematic development. Furthermore, due to their multifunctional nature, some referential MWUs also perform the function of stance and discourse organizing. When expressing stance, most MWUs express impersonal epistemic stance, with the purpose of minimizing the imposition of the reporters’ opinions. Other word sequences appear to be deontic in nature, as they are mainly realized by the MWUs incorporating with require or modal verbs. The primary function of these MWUs is to set out the obligations and issue suggestions for the agents according to certain norms and regulations. When functioning as discourse organizer, the domain-specific MWUs usually adopt the pattern of ‘that-clause controlled by main verbs in active voice’ to introduce the topics. Unlikely, when using for elaborating the topics, they tend to clarify the logical relationships, especially the causative-resultative relation, rather than providing additional information in MAIR genre. Fourth, the distinctive semantic features of the domain-specific MWUs can be best reflected when these MWUs perform the functions of activity identification and specification. For instance, most domain-specific MWUs used for describing activities are of general nature, but they convey specialized meaning in MAIR genre. Similarly, when domain-specific MWUs are used to provide tangible or intangible frames for specifying certain attributes, the use of these MWUs in MAIR genre is significantly deviant from their use in general English register. In all, by gaining insights into the salient features of the domain-specific MWUs in MAIR genre, the present study may make contributions and implications in the following aspects: the construction of extraction method for domain-specific MWUs, the compilation of maritime-specific MWU list, the teaching and learning of maritime English, especially the maritime-specific MWUs, and providing reference for writing MAIR to the experts who are from non-native English speaking countries.Abstract i List of Tables v List of Figures vii Chapter 1 Introduction 1 1.1. Background of this study 1 1.2. Objectives of this study 3 1.3. Significance of this study 4 1.4. Terminological issues 5 1.5. Organization of this dissertation 6 Chapter 2 Theoretical background 8 2.1. Understanding the notions of phraseology 8 2.2.1. An overview of influential notions of phraseology 9 2.1.2. Parameters of defining MWUs 13 2.1.3. Operational definition of MWUs 17 2.1.4. An overview of influential taxonomy of phraseology 19 2.2. Theoretical discussion of MWUs 23 2.2.1. Theoretical framework of this study 23 2.2.2. Nature of multiword units 25 2.2.3. Previous studies of phraseology 29 Chapter 3 Analytical framework and research design 37 3.1. Analytical framework 37 3.1.1 Analytical framework for syntactic features of domain-specific MWUs 38 3.1.2. Analytical framework for semantic features of domain-specific MWUs 40 3.1.3. Analytical framework for functional features of domain-specificMWUs 42 3.2. Research questions 43 3.3. Corpora used in this study 44 3.3.1. Corpus of Marine Accident Investigation Reports (COMAIR) 44 3.3.2. British National Corpus Baby (BNC Baby) 47 3.4. Tools and procedures for data analysis 48 3.4.1. Tools for data processing 48 3.4.2. Procedures for data analysis 49 3.4.3. Inter-rater reliability 50 3.5. Summary 51 Chapter 4 Identification of domain-specific MWUs in the COMAIR 52 4.1. Current approaches to MWU extraction 52 4.2. My proposed approach to domain-specific MWU extraction 53 4.3. The detailed process of domain-specific MWU extraction 55 4.3.1. Step 1: N-gram retrieval 55 4.3.2. Step 2: Keyword-gram extraction 56 4.3.3. Step 3: Measuring the association strength of keyword-grams 58 4.3.4. Step 4: Filtering out process 66 4.3.5. Step 5: Domain-specific MWU identification 70 Chapter 5 Frequency distributions and syntactic features of domain-specific MWUs 72 5.1. Frequency distributions of domain-specific MWUs 72 5.1.1. Frequency distributions of domain-specific MWUs in various lengths 72 5.1.2. Overall frequency distribution across different frequency bands 74 5.2. Syntactic features of domain-specific MWUs 76 Chapter 6 Functional and semantic features of domain-specific MWUs 80 6.1. Distributions across primary discourse functions 80 6.2. Multiple functioning 82 6.3. Stance MWUs 84 6.3.1. Notion of stance MWUs 84 6.3.2. Stance MWUs in COMAIR 84 6.4. Discourse organizing MWUs 90 6.4.1. Notion of discourse organizing MWUs 90 6.4.2. Discourse organizing MWUs in COMAIR 90 6.5. Referential MWUs 96 6.5.1. Notion of referential MWUs 97 6.5.2. Referential MWUs in COMAIR 97 6.6. Summary 112 Chapter 7 Conclusions and implications 113 7.1. Summary of the major findings 113 7.2. Implications of this study 116 7.3. Limitations of this study 117 References 118 Appendix 132Docto
    corecore