4,397 research outputs found
A Corpus-Based Investigation of Definite Description Use
We present the results of a study of definite descriptions use in written
texts aimed at assessing the feasibility of annotating corpora with information
about definite description interpretation. We ran two experiments, in which
subjects were asked to classify the uses of definite descriptions in a corpus
of 33 newspaper articles, containing a total of 1412 definite descriptions. We
measured the agreement among annotators about the classes assigned to definite
descriptions, as well as the agreement about the antecedent assigned to those
definites that the annotators classified as being related to an antecedent in
the text. The most interesting result of this study from a corpus annotation
perspective was the rather low agreement (K=0.63) that we obtained using
versions of Hawkins' and Prince's classification schemes; better results
(K=0.76) were obtained using the simplified scheme proposed by Fraurud that
includes only two classes, first-mention and subsequent-mention. The agreement
about antecedents was also not complete. These findings raise questions
concerning the strategy of evaluating systems for definite description
interpretation by comparing their results with a standardized annotation. From
a linguistic point of view, the most interesting observations were the great
number of discourse-new definites in our corpus (in one of our experiments,
about 50% of the definites in the collection were classified as discourse-new,
30% as anaphoric, and 18% as associative/bridging) and the presence of
definites which did not seem to require a complete disambiguation.Comment: 47 pages, uses fullname.sty and palatino.st
Alts, Abbreviations, and AKAs:historical onomastic variation and automated named entity recognition
The accurate automated identification of named places is a major concern for scholars in the digital humanities, and especially for those engaged in research that depends upon the gazetteer-led recognition of specific aspects. The field of onomastics examines the linguistic roots and historical development of names, which have for the most part only standardised into single officially recognised forms since the late nineteenth century. Even slight spelling variations can introduce errors in geotagging techniques, and these differences in place-name spellings are thus vital considerations when seeking high rates of correct geospatial identification in historical texts. This article offers an overview of typical name-based variation that can cause issues in the accurate geotagging of any historical resource. The article argues that the careful study and documentation of these variations can assist in the development of more complete onymic records, which in turn may inform geotaggers through a cycle of variational recognition. It demonstrates how patterns in regional naming variation and development, across both specific and generic name elements, can be identified through the historical records of each known location. The article uses examples taken from a digitised corpus of writing about the English Lake District, a collection of 80 texts that date from between 1622 and 1900. Four of the more complex spelling-based problems encountered during the creation of a manual gazetteer for this corpus are examined. Specifically, the article demonstrates how and why such variation must be expected, particularly in the years preceding the standardisation of place-name spellings. It suggests how procedural developments may be undertaken to account for such georeferential issues in the Named Entity Recognition strategies employed by future projects. Similarly, the benefits of such multi-genre corpora to assist in completing onomastic records is also shown through examples of new name forms discovered for prominent sites in the Lake District. This focus is accompanied by a discussion of the influence of literary works on place-name standardisation – an aspect not typically accounted for in traditional onomastic study – to illustrate the extent to which authorial interests in regional toponymic histories can influence linguistic development
Under-explicit and minimally explicit reference: Evidence from a longitudinal case study
This chapter reports on a 2 ½ year longitudinal case study of one Korean speaker of English, focusing on the development of her command of accessibility marking in referring to persons. The data are derived from informal, open interviews spanning the entire length of the participant’s enrolment in a Bachelor of Nursing programme in New Zealand. These interviews occurred every few weeks during semester (17 in total), and were typically between 45 minutes to one hour in length. The participant reported that she used these interviews as “a kind of reflective journal”, in which she discussed her classes, interactions with classmates, tutors and others, her assignments, and other experiences in New Zealand. The events she reported are rich in references to individuals.
Using a previously reported coding scheme (Ryan, 2015), these data were analysed in relation to pragmatic felicity, particularly concerning the felicity of accessibility marking for referents of varying cognitive status in contexts of topic or focus continuity or shift. These data [yet to be analysed] provide evidence of the developmental progression of the participant’s command of reference in English.
This chapter contributes substantially to the literature in several ways. In general, there has been a lack of longitudinal case studies of pragmatic development in any domain, including few – if any – previous longitudinal studies focusing on reference; the present analysis is therefore expected to reveal previously unreported details of the trajectory of pragmatic development in reference. The present study is also one of the few working with oral data that was generated in ways other than an elicited communication task. Finally, the study contributes to the somewhat still contentious issue of to what extent mainstream study in an English-speaking context leads to genuine language gains
Qualities, objects, sorts, and other treasures : gold digging in English and Arabic
In the present monograph, we will deal with questions of lexical typology in the nominal domain. By the term "lexical typology in the nominal domain", we refer to crosslinguistic regularities in the interaction between (a) those areas of the lexicon whose elements are capable of being used in the construction of "referring phrases" or "terms" and (b) the grammatical patterns in which these elements are involved. In the traditional analyses of a language such as English, such phrases are called "nominal phrases". In the study of the lexical aspects of the relevant domain, however, we will not confine ourselves to the investigation of "nouns" and "pronouns" but intend to take into consideration all those parts of speech which systematically alternate with nouns, either as heads or as modifiers of nominal phrases. In particular, this holds true for adjectives both in English and in other Standard European Languages. It is well known that adjectives are often difficult to distinguish from nouns, or that elements with an overt adjectival marker are used interchangeably with nouns, especially in particular semantic fields such as those denoting MATERIALS or NATlONALlTIES. That is, throughout this work the expression "lexical typology in the nominal domain" should not be interpreted as "a typology of nouns", but, rather, as the cross-linguistic investigation of lexical areas constitutive for "referring phrases" irrespective of how the parts-of-speech system in a specific language is defined
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
Improved Coreference Resolution Using Cognitive Insights
Coreference resolution is the task of extracting referential expressions, or mentions, in text and clustering these by the entity or concept they refer to. The sustained research interest in the task reflects the richness of reference expression usage in natural language and the difficulty in encoding insights from linguistic and cognitive theories effectively. In this thesis, we design and implement LIMERIC, a state-of-the-art coreference resolution engine. LIMERIC naturally incorporates both non-local decoding and entity-level modelling to achieve the highly competitive benchmark performance of 64.22% and 59.99% on the CoNLL-2012 benchmark with a simple model and a baseline feature set. As well as strong performance, a key contribution of this work is a reconceptualisation of the coreference task. We draw an analogy between shift-reduce parsing and coreference resolution to develop an algorithm which naturally mimics cognitive models of human discourse processing. In our feature development work, we leverage insights from cognitive theories to improve our modelling. Each contribution achieves statistically significant improvements and sum to gains of 1.65% and 1.66% on the CoNLL-2012 benchmark, yielding performance values of 65.76% and 61.27%. For each novel feature we propose, we contribute an accompanying analysis so as to better understand how cognitive theories apply to real language data. LIMERIC is at once a platform for exploring cognitive insights into coreference and a viable alternative to current systems. We are excited by the promise of incorporating our and further cognitive insights into more complex frameworks since this has the potential to both improve the performance of computational models, as well as our understanding of the mechanisms underpinning human reference resolution
Acquiring Word-Meaning Mappings for Natural Language Interfaces
This paper focuses on a system, WOLFIE (WOrd Learning From Interpreted
Examples), that acquires a semantic lexicon from a corpus of sentences paired
with semantic representations. The lexicon learned consists of phrases paired
with meaning representations. WOLFIE is part of an integrated system that
learns to transform sentences into representations such as logical database
queries. Experimental results are presented demonstrating WOLFIE's ability to
learn useful lexicons for a database interface in four different natural
languages. The usefulness of the lexicons learned by WOLFIE are compared to
those acquired by a similar system, with results favorable to WOLFIE. A second
set of experiments demonstrates WOLFIE's ability to scale to larger and more
difficult, albeit artificially generated, corpora. In natural language
acquisition, it is difficult to gather the annotated data needed for supervised
learning; however, unannotated data is fairly plentiful. Active learning
methods attempt to select for annotation and training only the most informative
examples, and therefore are potentially very useful in natural language
applications. However, most results to date for active learning have only
considered standard classification tasks. To reduce annotation effort while
maintaining accuracy, we apply active learning to semantic lexicons. We show
that active learning can significantly reduce the number of annotated examples
required to achieve a given level of performance
Syntactic and Semantic Patterns of Domain-specific Multiword Units in Marine Accident Investigation Reports
The present study is a systematic corpus-based investigation of the domain-specific multiword units
(henceforth MWUs) in marine accident investigation reports (henceforth MAIR), with a view to
characterizing their most prominent syntactic, semantic and functional features.
To achieve these principal objectives, the target MWUs were first identified by applying a new
approach, which incorporates the notion of ‘meaning’ into statistical-based measures. This method
ensures the domain-specific MWU extraction to the largest extent and provides valid data for the
subsequent analysis. Through proposing a three-dimensional analytical framework, this study has
obtained the following findings:
First, the domain-specific MWUs are largely composed of two-word sequences, while the occurrences
of 4- and 5-word MWUs are relatively rare. Among all the target MWUs, only 1.10% of the expressions
occur very commonly within the genre (˚1,000 times). By contrast, the majority of the expressions
(70.97%) occur with the frequency less than 100 times. The skewed distribution indicates that MAIR
genre tends to employ a wide variety of domain-specific MWUs rather than repetition of a
small number of common expressions.
Second, in terms of the syntactic features of the domain-specific MWUs, NP structure is the most
commonly employed grammatical type. The abundant use of this structure implies that the
domain-specific meaning of MAIR genre is largely carried in the nominal group. Apart from NP structure, there is also a marked prevalence of VP structures among the domain-specific MWUs in MAIR genre and these MWUs present structural variation. Of all the VP-based patterns, the ‘verb phrase with active verb’ pattern stands out since it incorporates a large number of action verbs, which are used to describe the actions done by people. The wide use of these phrases implies that MAIR genre tends to highlight the people’s
roles during the accidents, with particular attention to the information about what or who caused or performed the activity. Similarly, PP structures were also frequently adopted by the domain-specific MWUs, especially the pattern beginning with preposition of. This pattern was mostly used to specify possessions. It thus can be inferred that the information that provided in MAIR genre tends to be concrete and specific.
Third, by conducting a functional analysis of the target MWUs, it was found that the primary function of the domain-specific MWUs is to express referential meanings and contribute to the thematic development. Furthermore, due to their multifunctional nature, some referential MWUs also perform the function of stance and discourse organizing. When expressing stance, most MWUs express impersonal epistemic stance, with the purpose of minimizing the imposition of the reporters’ opinions. Other word sequences appear to be deontic in nature, as they are mainly realized by the MWUs incorporating with require or modal verbs. The primary function of these MWUs is to set out the obligations and issue suggestions for the agents according to certain norms and regulations. When functioning as discourse organizer, the domain-specific MWUs usually adopt the pattern of ‘that-clause controlled by main verbs in active voice’ to introduce the topics.
Unlikely, when using for elaborating the topics, they tend to clarify the logical relationships, especially the causative-resultative relation, rather than providing additional information in MAIR genre.
Fourth, the distinctive semantic features of the domain-specific MWUs can be best reflected when
these MWUs perform the functions of activity identification and specification. For instance, most
domain-specific MWUs used for describing activities are of general nature, but they convey
specialized meaning in MAIR genre. Similarly, when domain-specific MWUs are used to provide tangible or intangible frames for specifying certain attributes, the use of these MWUs in MAIR genre is significantly deviant from their use in general English register.
In all, by gaining insights into the salient features of the domain-specific MWUs in MAIR genre,
the present study may make contributions and implications in the following aspects: the
construction of extraction method for domain-specific MWUs, the compilation of maritime-specific
MWU list, the teaching and learning of maritime English, especially the maritime-specific MWUs, and
providing reference for writing MAIR to the experts who are from non-native English speaking
countries.Abstract i
List of Tables v
List of Figures vii
Chapter 1 Introduction 1
1.1. Background of this study 1
1.2. Objectives of this study 3
1.3. Significance of this study 4
1.4. Terminological issues 5
1.5. Organization of this dissertation 6
Chapter 2 Theoretical background 8
2.1. Understanding the notions of phraseology 8
2.2.1. An overview of influential notions of phraseology 9
2.1.2. Parameters of defining MWUs 13
2.1.3. Operational definition of MWUs 17
2.1.4. An overview of influential taxonomy of phraseology 19
2.2. Theoretical discussion of MWUs 23
2.2.1. Theoretical framework of this study 23
2.2.2. Nature of multiword units 25
2.2.3. Previous studies of phraseology 29
Chapter 3 Analytical framework and research design 37
3.1. Analytical framework 37
3.1.1 Analytical framework for syntactic features of domain-specific MWUs 38
3.1.2. Analytical framework for semantic features of domain-specific MWUs 40
3.1.3. Analytical framework for functional features of domain-specificMWUs 42
3.2. Research questions 43
3.3. Corpora used in this study 44
3.3.1. Corpus of Marine Accident Investigation Reports (COMAIR) 44
3.3.2. British National Corpus Baby (BNC Baby) 47
3.4. Tools and procedures for data analysis 48
3.4.1. Tools for data processing 48
3.4.2. Procedures for data analysis 49
3.4.3. Inter-rater reliability 50
3.5. Summary 51
Chapter 4 Identification of domain-specific MWUs in the COMAIR 52
4.1. Current approaches to MWU extraction 52
4.2. My proposed approach to domain-specific MWU extraction 53
4.3. The detailed process of domain-specific MWU extraction 55
4.3.1. Step 1: N-gram retrieval 55
4.3.2. Step 2: Keyword-gram extraction 56
4.3.3. Step 3: Measuring the association strength of keyword-grams 58
4.3.4. Step 4: Filtering out process 66
4.3.5. Step 5: Domain-specific MWU identification 70
Chapter 5 Frequency distributions and syntactic features of domain-specific MWUs 72
5.1. Frequency distributions of domain-specific MWUs 72
5.1.1. Frequency distributions of domain-specific MWUs in various lengths 72
5.1.2. Overall frequency distribution across different frequency bands 74
5.2. Syntactic features of domain-specific MWUs 76
Chapter 6 Functional and semantic features of domain-specific MWUs 80
6.1. Distributions across primary discourse functions 80
6.2. Multiple functioning 82
6.3. Stance MWUs 84
6.3.1. Notion of stance MWUs 84
6.3.2. Stance MWUs in COMAIR 84
6.4. Discourse organizing MWUs 90
6.4.1. Notion of discourse organizing MWUs 90
6.4.2. Discourse organizing MWUs in COMAIR 90
6.5. Referential MWUs 96
6.5.1. Notion of referential MWUs 97
6.5.2. Referential MWUs in COMAIR 97
6.6. Summary 112
Chapter 7 Conclusions and implications 113
7.1. Summary of the major findings 113
7.2. Implications of this study 116
7.3. Limitations of this study 117
References 118
Appendix 132Docto
- …