8 research outputs found

    Parsing early and late modern English corpora

    Get PDF
    We describe, evaluate, and improve the automatic annotation of diachronic corpora at the levels of word-class, lemma, chunks, and dependency syntax. As corpora we use the ARCHER corpus (texts from 1600 to 2000) and the ZEN corpus (texts from 1660 to 1800). Performance on Modern English is considerably lower than on Present Day English (PDE). We present several methods that improve performance. First we use the spelling normalization tool VARD to map spelling variants to their PDE equivalent, which improves tagging. We investigate the tagging changes that are due to the normalization and observe improvements, deterioration, and missing mappings. We then implement an optimized version, using VARD rules and preprocessing steps to improve normalization. We evaluate the improvement on parsing performance, comparing original text, standard VARD, and our optimized version. Over 90% of the normalization changes lead to improved parsing, and 17.3% of all 422 manually annotated sentences get a net improved parse. As a next step, we adapt the parser's grammar, add a semantic expectation model and a model for prepositional phrases (PP)-attachment interaction to the parser. These extensions improve parser performance, marginally on PDE, more considerably on earlier texts—2—5% on PP-attachment relations (e.g. from 63.6 to 68.4% and from 70 to 72.9% on 17th century texts). Finally, we briefly outline linguistic applications and give two examples: gerundials and auxiliary verbs in the ZEN corpus, showing that despite high noise levels linguistic signals clearly emerge, opening new possibilities for large-scale research of gradient phenomena in language chang

    Challenges in Categorization : Corpus-based Studies of Adjectival Premodifiers in English

    Get PDF
    This thesis draws together a series of articles on premodifying -ing participles and adjectives in English (e.g. "interesting", "advancing"). The studies are intended to contribute to our understanding of a variety of topics, including the meaning and function of participles and other adjectival premodifiers, their use in different registers, and their change over time. The overarching topic that connects all the articles thematically is linguistic categorization, which is here understood as a process of abstraction through which language users group linguistic elements together according to their form, meaning, function and patterns of use. Some of the articles discuss categories and categorization in terms of word classes (adjectives/verbs), while the focus of others is on semantic categorization (subjective/objective premodifiers) or the categorization of linguistic registers based on the distribution of premodified noun phrases. On the one hand, then, this thesis bears on the general discussion of the nature of linguistic categorization and category change. On the other hand, it continues a series of descriptions and analyses of adjectival premodifiers in contemporary research and the large reference grammars of Present-day English. One of the main findings of this thesis concerns the tendency of subjective adjectives, adjective phrases and nouns to be used with indefinite determination and in a complement role in discourse. This tendency is explained by a preferential mapping between subjectivity and new information, and the correlation is shown to have interesting uses in more practical tasks, such as semantic disambiguation, corpus annotation and the study of semantic change. Another important result is the tendency of degree modifiers to be used proportionally more often in predication than in attribution. These kinds of results support a usage-based approach to word classes, where categories like Verb or Adjective are regarded as emergent schemas that arise from actual patterns of use. The thesis also includes a wide-ranging survey of the relevant philosophical and linguistic literature on categorization.Tarkastelen väitöskirjassani englannin kielen eri adjektiivimääritteitä sekä synkronisesta että diakronisesta näkökulmasta. Päähuomio kiinnittyy erilaisten partisiippimääritteiden, etenkin -ing-partisiippien (esim. "interesting", "advancing") kategorisointiin, merkitykseen ja käyttöön, mutta paneudun väitöskirjan osatutkimuksissa myös -ed-partisiippien (esim. "scared") sekä tavallisten adjektiivien käyttöön. Tärkeimpiä teemoja työssäni ovat adjektiivisten sanojen merkityksen subjektiivisuus ja subjektifikaatio, sanaluokkien astemaisuus sekä sanojen vähittäinen kategorian muutos (esim. verbintapaisen -ing-partisiipin astemainen muutos adjektiiviksi). Tutkimukseni pohjaa englannin kielen korpusaineistoon, ja se kattaa ajanjakson aina varhaisuusenglannista nykyenglantiin. Väitöskirjatyöni on vahvasti empiirinen, ja sen tärkeimpiä yleisiä tuloksia on havainto korrelaatiosta subjektiivisten merkitysten ja tietynlaisten rakenteiden välillä. Olen korpusaineiston avulla mm. osoittanut, että vahvasti subjektiivisia merkityksiä ilmaistaan englannin kielessä tyypillisesti indefiniittisissä rakenteissa. Esimerkiksi "a much better result" on aineistossa huomattavasti yleisempi kuin "the much better result". Samoin astemaisuutta kuvaavat adverbit, kuten "very" ja "extremely", esiintyvät aineistossa merkittävästi useammin predikaatiossa kuin attribuutiossa (esim. "this is very nice" on yleisempi kuin "a very nice idea"). Esitän väitöskirjassani, että tällaiset havainnot ovat relevantteja sekä kielen muutoksen selittämisessä että siinä tavassa, jolla sanaluokat tulisi ymmärtää kielitieteen teoriassa: tutkimuksessani sanaluokat käsitetään kielen käyttäjän kokemuksiin perustuvina abstraktioina (skeemoina), jotka ovat dynaamisia ja jotka voivat muuttua sekä pitkällä että lyhyemmällä aikavälillä. Tämä ajatus on erityisen tärkeä konstruktiokieliopin teorian kannalta viitekehyksen, jota sovellan väitöskirjani viimeisessä osatutkimuksessa

    Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication

    Get PDF
    Syntactic complexity has been an area of significant interest in L2 writing development studies over the past 45 years. Despite the regularity in which syntactic complexity measures have been employed, the construct is still relatively under-developed, and, as a result, the cumulative results of syntactic complexity studies can appear opaque. At least three reasons exist for the current state of affairs, namely the lack of consistency and clarity by which indices of syntactic complexity have been described, the overly broad nature of the indices that have been regularly employed, and the omission of indices that focus on usage-based perspectives. This study seeks to address these three gaps through the development and validation of the Tool for the Automatic Assessment of Syntactic Sophistication and Complexity (TAASSC). TAASSC measures large and fined grained clausal and phrasal indices of syntactic complexity and usage-based frequency/contingency indices of syntactic sophistication. Using TAASSC, this study will address L2 writing development in two main ways: through the examination of syntactic development longitudinally and through the examination of human judgments of writing proficiency (e.g., expert ratings of TOEFL essays). This study will have important implications for second language acquisition, second language writing, and language assessment

    The BNC parsed with RASP4UIMA

    No full text
    We have integrated the RASP system with the UIMA framework (RASP4UIMA) and used this to parse the XML-encoded version of the British National Corpus (BNC). All original annotation is preserved, and parsing information, mainly in the form of grammatical relations, is added in an XML format. A few specific adaptations of the system to give better results with the BNC are discussed briefly. The RASP4UIMA system is publicly available and can be used to parse other corpora or document collections, and the final parsed version of the BNC will be deposited with the Oxford Text Archiv
    corecore