1,396 research outputs found

    Modeling Global Syntactic Variation in English Using Dialect Classification

    Get PDF
    This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers

    Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar

    Get PDF
    A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin

    Synthesis and reactivity of silylmethylcyclopropanes

    Get PDF
    PhDSubstituted tetrahydrofurans (THFs) are common structural motifs found in natural products. The biological activity and structural complexity of these compounds makes their efficient construction with controlled regio- and stereochemistry a significant challenge in organic synthesis. This thesis is concerned with investigating the use of silylmethylcyclopropanes as precursors for the efficient and practical synthesis of tetrahydrofurans. The first chapter consists of a review of the relevant literature comprising of four sections. The first section is a brief review of the current methods for the synthesis of tetrahydrofurans with discussions of the advantages and disadvantages of these methods. Next, the concept of donoracceptor cyclopropanes is introduced and examples of how they have been employed in tetrahydrofuran synthesis are given. The third section outlines the uses of silicon in organic synthesis with particular reference to the physical and electronic influences of silicon on organic molecules. Finally, the chapter concludes with an overview of the application of Lewis acid promoted cycloadditon reactions of allylsilanes and silymethylcyclopropanes to the preparation of tetrahydrofurans. The second chapter discusses the preparation and purification of unsubstituted silylmethylcyclopropanes outlining various conditions tried and the array of different substituents that may be attached to the silicon. The successful Lewis acid promoted [3+2] cycloaddition reaction of various silylmethylcyclopropanes with -keto-aldehydes is presented, together with a detailed account of the screening studies of different Lewis acids and aldehydes, and optimisation of reaction conditions. The advantages of having a ketone functionality in the final compound are practically demonstrated by way of several synthetic modifications to produce a range of chemically diverse compounds containing the tetrahydrofuran substructure. The third chapter presents the synthesis of substituted silylmethylcyclopropanes and their attempted cyclisations using the conditions previously developed for unsubstituted silylmethylcyclopropanes. Following attempts to use Lewis acid-activated aldehydes in [3+2] cycloaddition reactions, and the consequent disadvantage of randomly trialling Lewis acids, chapter four presents our 4 investigations into the use of NMR spectroscopy as a probe to establish a relative quantitative scale of carbonyl activation with different Lewis acids. Our studies into this method are presented along with the NMR data of several carbonyl-based Lewis bases complexed to the Lewis acids that proved successful in the cycloaddition reactions. Chapter five provides detailed experimental procedures and characterisation data for the compounds described within this thesis.Engineering and Physical Science Research counci

    Everyone You’ll Never Meet

    Get PDF
    Everyone You’ll Never Meet is a multi-perspective mystery set in the fictional southern town of Ransom, South Carolina. It follows a young woman whose boyfriend disappears, a failed megachurch pastor at personal and professional crossroads, and a young father coming to grips with the shape of his life in light of a chance encounter with a murder victim

    Modeling the Complexity and Descriptive Adequacy of Construction Grammars

    Get PDF
    This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar). These two quantities are combined to measure the quality of potential CxGs against unannotated corpora, supporting discovery-device CxGs for English, Spanish, French, German, and Italian. The results show (i) that these grammars provide significant generalizations as measured using compression and (ii) that more complex CxGs with access to multiple levels of representation provide greater generalizations than single-representation CxGs

    Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

    Get PDF
    This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties

    Exposure and Emergence in Usage-Based Grammar: Computational Experiments in 35 Languages

    Full text link
    This paper uses computational experiments to explore the role of exposure in the emergence of construction grammars. While usage-based grammars are hypothesized to depend on a learner's exposure to actual language use, the mechanisms of such exposure have only been studied in a few constructions in isolation. This paper experiments with (i) the growth rate of the constructicon, (ii) the convergence rate of grammars exposed to independent registers, and (iii) the rate at which constructions are forgotten when they have not been recently observed. These experiments show that the lexicon grows more quickly than the grammar and that the growth rate of the grammar is not dependent on the growth rate of the lexicon. At the same time, register-specific grammars converge onto more similar constructions as the amount of exposure increases. This means that the influence of specific registers becomes less important as exposure increases. Finally, the rate at which constructions are forgotten when they have not been recently observed mirrors the growth rate of the constructicon. This paper thus presents a computational model of usage-based grammar that includes both the emergence and the unentrenchment of constructions

    Validating and Exploring Large Geographic Corpora

    Full text link
    This paper investigates the impact of corpus creation decisions on large multi-lingual geographic web corpora. Beginning with a 427 billion word corpus derived from the Common Crawl, three methods are used to improve the quality of sub-corpora representing specific language-country pairs like New Zealand English: (i) the agreement of independent language identification systems, (ii) hash-based deduplication, and (iii) location-specific outlier detection. The impact of each of these steps is then evaluated at the language level and the country level by using corpus similarity measures to compare each resulting corpus with baseline data sets. The goal is to understand the impact of upstream data cleaning decisions on downstream corpora with a specific focus on under-represented languages and populations. The evaluation shows that the validity of sub-corpora is improved with each stage of cleaning but that this improvement is unevenly distributed across languages and populations. This result shows how standard corpus creation techniques can accidentally exclude under-represented populations
    • …
    corecore