4,068 research outputs found

    Parsing early and late modern English corpora

    Get PDF
    We describe, evaluate, and improve the automatic annotation of diachronic corpora at the levels of word-class, lemma, chunks, and dependency syntax. As corpora we use the ARCHER corpus (texts from 1600 to 2000) and the ZEN corpus (texts from 1660 to 1800). Performance on Modern English is considerably lower than on Present Day English (PDE). We present several methods that improve performance. First we use the spelling normalization tool VARD to map spelling variants to their PDE equivalent, which improves tagging. We investigate the tagging changes that are due to the normalization and observe improvements, deterioration, and missing mappings. We then implement an optimized version, using VARD rules and preprocessing steps to improve normalization. We evaluate the improvement on parsing performance, comparing original text, standard VARD, and our optimized version. Over 90% of the normalization changes lead to improved parsing, and 17.3% of all 422 manually annotated sentences get a net improved parse. As a next step, we adapt the parser's grammar, add a semantic expectation model and a model for prepositional phrases (PP)-attachment interaction to the parser. These extensions improve parser performance, marginally on PDE, more considerably on earlier texts—2—5% on PP-attachment relations (e.g. from 63.6 to 68.4% and from 70 to 72.9% on 17th century texts). Finally, we briefly outline linguistic applications and give two examples: gerundials and auxiliary verbs in the ZEN corpus, showing that despite high noise levels linguistic signals clearly emerge, opening new possibilities for large-scale research of gradient phenomena in language chang

    An automatic part-of-speech tagger for Middle Low German

    Get PDF
    Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them

    Clitic placement in old and modern Spanish: a dynamic account

    Get PDF

    Tracing Syntactic Change in the Scientific Genre: Two Universal Dependency-parsed Diachronic Corpora of Scientific English and German

    Get PDF
    We present two comparable diachronic corpora of scientific English and German from the Late Modern Period (17th c.--19th c.) annotated with Universal Dependencies. We describe several steps of data pre-processing and evaluate the resulting parsing accuracy showing how our pre-processing steps significantly improve output quality. As a sanity check for the representativity of our data, we conduct a case study comparing previously gained insights on grammatical change in the scientific genre with our data. Our results reflect the often reported trend of English scientific discourse towards heavy noun phrases and a simplification of the sentence structure (Halliday, 1988; Halliday and Martin, 1993; Biber and Gray, 2011; Biber and Gray, 2016). We also show that this trend applies to German scientific discourse as well. The presented corpora are valuable resources suitable for the contrastive analysis of syntactic diachronic change in the scientific genre between 1650 and 1900. The presented pre-processing procedures and their evaluations are applicable to other languages and can be useful for a variety of Natural Language Processing tasks such as syntactic parsing.This work is supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 232722074 – SFB 1102

    Towards a grapho-phonologically parsed corpus of medieval Scots:Database design and technical solutions

    Get PDF
    This paper presents a newly constructed corpus of sound-to-spelling mappings in medieval Scots, which stems from the work of the From Inglis to Scots (FITS) project. We have developed a systematic approach to the relationships between individual spellings and proposed sound values, and recorded these mutual links in a relational database. In this paper, we introduce the theoretical underpinnings of sound-to-spelling and spelling-to-sound mappings, and show how a Scots root morpheme undergoes grapho-phonological parsing, the analytical procedure that is employed to break down spelling sequences into sound units. We explain the data collection and annotation for the FITS Corpus (Alcorn et al., forthcoming), drawing attention to the extensive meta-data which accompany each analysed unit of spelling and sound. The database records grammatical and lexical information about the root, the positional arrangement of segments within the root, labels for the nuclei, vowels and consonants, the morphological context, and extra-linguistic detail of the text a given root was taken from (date, place and text type). With this wealth of information, the FITS corpus is capable of answering complex queries about the sound and spelling systems of medieval Scots. We also suggest how our methodology can be transferred to other non-standardised spelling systems

    Joseph Wright\u27s English Dialect Dictionary (1898-1905) Computerised: architecture and retrieval routine

    Get PDF
    The Innsbruck government-funded project SPEED (Spoken English in Early Dialects), scheduled for 2006 to 2009, has the aim of digitising and evaluating the famous English Dialect Dictionary by Joseph Wright (1898-1906). This paper topicalises the value of the electronic version of the dictionary and problems of its complex architecture, as well as the retrieval routine aimed at. The paper is an elaborated version of the Powerpoint presentation delivered at the conference. First of all, I try to prove the great value of Wright\u27s dictionary from the point of view of English studies. On the other hand, given the mixed nature of the participants of the Dagstuhl conference, the paper tackles interface problems typically arising when printed texts are computerised, problems ranging from "normalisation" to aspects of parsing and of the design of the query mask

    Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French

    Get PDF
    This paper shows that training a lexicalized parser on a lemmatized morphologically-rich treebank such as the French Treebank slightly improves parsing results. We also show that lemmatizing a similar in size subset of the English Penn Treebank has almost no effect on parsing performance with gold lemmas and leads to a small drop of performance when automatically assigned lemmas and POS tags are used. This highlights two facts: (i) lemmatization helps to reduce lexicon data-sparseness issues for French, (ii) it also makes the parsing process sensitive to correct assignment of POS tags to unknown words

    Split coordination in English : Why we need parsed corpora

    Get PDF
    In this article we provide a practical demonstration of how syntactically annotated corpora can be used to investigate research questions with a diachronic depth and synchronic breadth that would not otherwise be possible. The phenomenon under investigation, split coordination, affects every type of coordinated constituent (subject/object DPs, predicate and attributive ADJPs, ADVPs, PPs, and DP objects of P) in Old English; and it, or a superficially similar construction, occurs continuously throughout the attested period from approx. 800 to the present day. We bring to bear different types of evidence to argue that split coordination in fact represents two different constructions, one of which remains stable over time while the other is lost in the post-Middle English period
    • 

    corecore