685 research outputs found

    A robust unification-based parser for Chinese natural language processing.

    Get PDF
    Chan Shuen-ti Roy.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 168-175).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.12Chapter 1.1. --- The nature of natural language processing --- p.12Chapter 1.2. --- Applications of natural language processing --- p.14Chapter 1.3. --- Purpose of study --- p.17Chapter 1.4. --- Organization of this thesis --- p.18Chapter 2. --- Organization and methods in natural language processing --- p.20Chapter 2.1. --- Organization of natural language processing system --- p.20Chapter 2.2. --- Methods employed --- p.22Chapter 2.3. --- Unification-based grammar processing --- p.22Chapter 2.3.1. --- Generalized Phase Structure Grammar (GPSG) --- p.27Chapter 2.3.2. --- Head-driven Phrase Structure Grammar (HPSG) --- p.31Chapter 2.3.3. --- Common drawbacks of UBGs --- p.33Chapter 2.4. --- Corpus-based processing --- p.34Chapter 2.4.1. --- Drawback of corpus-based processing --- p.35Chapter 3. --- Difficulties in Chinese language processing and its related works --- p.37Chapter 3.1. --- A glance at the history --- p.37Chapter 3.2. --- Difficulties in syntactic analysis of Chinese --- p.37Chapter 3.2.1. --- Writing system of Chinese causes segmentation problem --- p.38Chapter 3.2.2. --- Words serving multiple grammatical functions without inflection --- p.40Chapter 3.2.3. --- Word order of Chinese --- p.42Chapter 3.2.4. --- The Chinese grammatical word --- p.43Chapter 3.3. --- Related works --- p.45Chapter 3.3.1. --- Unification grammar processing approach --- p.45Chapter 3.3.2. --- Corpus-based processing approach --- p.48Chapter 3.4. --- Restatement of goal --- p.50Chapter 4. --- SERUP: Statistical-Enhanced Robust Unification Parser --- p.54Chapter 5. --- Step One: automatic preprocessing --- p.57Chapter 5.1. --- Segmentation of lexical tokens --- p.57Chapter 5.2. --- "Conversion of date, time and numerals" --- p.61Chapter 5.3. --- Identification of new words --- p.62Chapter 5.3.1. --- Proper nouns ´ؤ Chinese names --- p.63Chapter 5.3.2. --- Other proper nouns and multi-syllabic words --- p.67Chapter 5.4. --- Defining smallest parsing unit --- p.82Chapter 5.4.1. --- The Chinese sentence --- p.82Chapter 5.4.2. --- Breaking down the paragraphs --- p.84Chapter 5.4.3. --- Implementation --- p.87Chapter 6. --- Step Two: grammar construction --- p.91Chapter 6.1. --- Criteria in choosing a UBG model --- p.91Chapter 6.2. --- The grammar in details --- p.92Chapter 6.2.1. --- The PHON feature --- p.93Chapter 6.2.2. --- The SYN feature --- p.94Chapter 6.2.3. --- The SEM feature --- p.98Chapter 6.2.4. --- Grammar rules and features principles --- p.99Chapter 6.2.5. --- Verb phrases --- p.101Chapter 6.2.6. --- Noun phrases --- p.104Chapter 6.2.7. --- Prepositional phrases --- p.113Chapter 6.2.8. --- """Ba2"" and ""Bei4"" constructions" --- p.115Chapter 6.2.9. --- The terminal node S --- p.119Chapter 6.2.10. --- Summary of phrasal rules --- p.121Chapter 6.2.11. --- Morphological rules --- p.122Chapter 7. --- Step Three: resolving structural ambiguities --- p.128Chapter 7.1. --- Sources of ambiguities --- p.128Chapter 7.2. --- The traditional practices: an illustration --- p.132Chapter 7.3. --- Deficiency of current practices --- p.134Chapter 7.4. --- A new point of view: Wu (1999) --- p.140Chapter 7.5. --- Improvement over Wu (1999) --- p.142Chapter 7.6. --- Conclusion on semantic features --- p.146Chapter 8. --- "Implementation, performance and evaluation" --- p.148Chapter 8.1. --- Implementation --- p.148Chapter 8.2. --- Performance and evaluation --- p.150Chapter 8.2.1. --- The test set --- p.150Chapter 8.2.2. --- Segmentation of lexical tokens --- p.150Chapter 8.2.3. --- New word identification --- p.152Chapter 8.2.4. --- Parsing unit segmentation --- p.156Chapter 8.2.5. --- The grammar --- p.158Chapter 8.3. --- Overall performance of SERUP --- p.162Chapter 9. --- Conclusion --- p.164Chapter 9.1. --- Summary of this thesis --- p.164Chapter 9.2. --- Contribution of this thesis --- p.165Chapter 9.3. --- Future work --- p.166References --- p.168Appendix I --- p.176Appendix II --- p.181Appendix III --- p.18

    Compounding in Namagowab and English: (exploring meaning creation in compounds)

    Get PDF
    This essay investigates compounding in Namagowab and English, which belong to two widely divergent groups of languages, the Khoesan and Indo-European, respectively. The first motive is to investigate how and why new words are created from existing ones. The reading and data interpretation seeks an understanding of word formation and an overview of semantic compositionality, structure and productivity, within the broad context of cognitive, lexicalist and distributed morphology paradigms. This coupled with history reading about the languages and its people, is used to speculate about why compounds feature in lexical creation. Compounding is prevalent in both languages and their distance in terms of phylogenetic relationships should allow limited generalizing about these processes of formation. Word lists taken from dictionaries in both languages were analyzed by entering the words in Excel spreadsheets so that various attributes of these words, such as word type, compound class (Noun, Verb, Preposition, Adjective and Adverb) and constituent class could be counted, and described with formulae, and compound and constituent meaning analyzed. The conclusion was that socio historical factors such as language contact, and aspects of cognition such as memory and transparency, account for compounding in a language in addition to typology

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    Spanish generation in the NLP system 'LOLITA'

    Get PDF
    The aim of this research has been to modify the NLG module in the NLP system LOLITA to enable it to produce Spanish utterances. Natural Language Generation (NLG) is the production of text in a surface language by the computer in order to meet communicative goals. The NLG module of LOLITA is currently able to generate English utterances. It provides the generation capabilities required for the prototype applications built onto LOLITA. The module also aids in the development and debugging of the system as NL utterances are easier to understand than the semantic network representation. The LOLITA generator receives as in put the whole LOLITA semantic network, 'SemNet',(the system knowledge base) and adopts the traditional two components architecture. However, the distribution of task between the planner and plan-realiser (planner and realiser in other systems) differs from that in traditional systems as the plan-realiser can perform tasks such as the selection of content traditionally performed by a planner. The Spanish generator is based upon the same theoretical principles as the current English generator. SemNet forms the input of the generator and has been expanded for this purpose by the addition of Spanish lexical entries and information associated with them. The existing planner module has been used while the plan-realiser has been modified by developing new solutions where the existing ones were not adequate for producing correct Spanish utterances. The generator has been implemented in the pure functional language Haskell, taking advantage of several features of this language and, like LOLITA, it has been built following Natural Language Engineering principles. These two aspects influencing the research are also described

    Gradient Metaphoricity of the Preposition in: A Corpus-based Approach to Chinese Academic Writing in English

    Get PDF
    In Cognitive Linguistics, a conceptual metaphor is a systematic set of correspondences between two domains of experience (Kövecses 2020: 2). In order to have an extensive understanding of metaphors, metaphoricity (Müller and Tag 2010; Dunn 2011; Jensen and Cuffari 2014; Nacey and Jensen 2017) has been emphasized to address one of the properties of metaphors in language usage: gradience (Hanks 2006; Dunn 2011, 2014), which indicates that metaphorical expressions can be measured. Despite many noteworthy contributions, studies of metaphoricity are often accused of subjectivity (Müller 2008; Jensen and Cuffari 2014; Jensen 2017), this is why this study uses a big corpus as a database. Therefore, the main aim of this dissertation is to measure the gradient senses of the preposition in in an objective way, thus mapping the highly systematic semantic extension. Based on these gradient senses, the semantic and syntactic features of the preposition in produced by advanced Chinese English-major learners are investigated, combining quantitative and qualitative research methods. A quantitative analysis of the literal and other ten metaphorical senses of the preposition in is made at first. In accounting for the five factors influencing image schemata of each sense: “scale of Landmark”, “visibility”, “path”, “inclusion” and “boundary”, the formula of measuring the gradability of metaphorical degree is deduced: Metaphoricity=[[#Visibility] +[#Path] +[#Inclusion] +[#Boundary]]*[#Scale of Landmark]. The result is that the primary sense has the highest value:12, and all other extended senses have values down to zero. The more shared features with proto-scene, the higher the value of the metaphorical sense, and the less metaphorical the sense. EVENT and PERSON are the “least metaphoric” (value = 9-11); SITUATION, NUMBER, CONTENT and FIELD are “weak metaphoric” (value = 6-8); Also included are SEGMENTATION, TIME and MANNER (value = 3-5), and they are “strong metaphoric”; PURPOSE shares the least feature with proto-scene, and it has the lowest value, so it is “most metaphoric” (value = 0-2). Then, a corpus-based approach is employed, which offers a model for employing a corpus-based approach in Cognitive Linguistics. It compares two compiled sub-corpora: Chinese Master Academic Writing Corpus and Chinese Doctorate Academic Writing Corpus. The findings show that, on the semantic level, Chinese English-major students overuse in with a low level of metaphoricity, even advanced learners use the most metaphorical in rarely. In terms of syntactic behaviours, the most frequent nouns in [in+noun] construction are weakly metaphoric, whilst the nouns in the construction [in the noun of] are EVENT sense, which is least metaphorical. Moreover, action verbs tend to be used in the construction [verb+in] and [in doing sth.] in both master and doctorate groups. In the qualitative study, the divergent usages of the preposition in are explored. The preposition in is often substituted with other prepositions, such as on and at. The fundamental reason for the Chinese learners’ weakness is the negative transfer from their mother tongue (Wang 2001; Gong 2007; Zhang 2010). Although in and its Chinese equivalence zai...li (在...里) share the same proto-scene, there are discrepancies: the metaphorical senses of the preposition in are TIME, PURPOSE, NUMBER, CONTENT, FIELD, EVENT, SITUATION, SEGMENTATION, MANNER, PERSON, while those of zai...li (在...里) are only five: TIME, CONTENT, EVENT, SITUATION and PERSON. Thus the image schemata of each sense cannot be correspondingly mapped onto each other in different languages. This study also provides evidence for the universality and variation of spatial metaphors on the ground of cultural models. Philosophically, it supports the standpoint of Embodiment philosophy that abstract concepts are constructed on the basis of spatial metaphors that are grounded in the physical and cultural experience

    A corpus-based error analysis of Korean learner English: from a cognitive linguistic perspective to the L2 mental lexicon

    Get PDF
    This thesis investigates lexical errors in a learner corpus that consists of essays written by Korean learners of English. Taking a cognitive linguistic perspective, it first presents the L2 lexical development model as a conceptual framework. Then, based on this framework, it proposes a new error taxonomy in which errors from the four lexical domains in the L2 mental lexicon can be categorised as either interlingual or intralingual errors. In order to identify whether the taxonomy is well-grounded, this thesis selects four error features, one from each of the four lexical domains: collocational errors of dimensional adjectives in the semantic domain; over-passivisation errors of non-alternating unaccusative verbs in the syntactic domain; derivational morphological errors in the morphological domain; spelling errors in the phonological/orthographic domain. The results, obtained through corpus-based error analysis, suggest that there is evidence of both interlingual and intralingual influences on the four error features from the four lexical domains. Based on the findings, this study provides pedagogical implications for English classrooms in Korea and recommends that the findings should be used to improve teaching materials and to raise awareness of the influences on the lexical errors

    Verbs in the written English of Chinese learners : a corpus-based comparison between non-native speakers and native speakers

    Get PDF
    This thesis consists of ten chapters and its research methodology is a combination of quantitative and qualitative. Chapter One introduces the theme of the thesis, a demonstration of a corpus-based comparative approach in detecting the needs of the learners by looking for the similarities and disparities between the learner English (the COLEC corpus) and the NS English (the LOCNESS corpus). Chapter Two reviews the literature in relevant learner language studies and indicates the tasks of the research. The data and technology are introduced in Chapter Three. Chapter Four shows how two verb lemma lists can be made by using the Wordsmith Tools supported by other corpus and IT tools. How to make sense of the verb lemma lists is the focus of the second part of this chapter. Chapter Five deals with the individual forms of verbs and the findings suggest that there is less homogeneity in the learner English than the NS English. Chapter Six extends the research to verb–noun relationships in the learner English and the NS English and the result shows that the learners prioritise verbs over nouns. Chapter Seven studies the learners’ preferences in using the patterns of KEEP compared with those of the NSs, and finds that the learners have various problems in using this simple verb. In this chapter, too, my reservations about the traditional use of ‘overuse’ and ‘underuse’ are expressed and a finer classification system is suggested. Chapter Eight compares another frequently-occurring verb, TAKE, in the aspect of collocates and yields similar findings that the learners have problems even with such simple vocabulary. In Chapter Nine, the research findings from Chapter Four to Chapter Eight are revisited and discussed in relation to the theme of the thesis. The concluding chapter, Chapter Ten, summarises the previous chapters and envisages how learner language studies will develop in the coming few years
    • …
    corecore