    Inadequacy of the chi-squared test to examine vocabulary differences between corpora

    Pearson's chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (Literary and Linguistic Computing, 2007, 22, 85-99) proposed various adaptations of this test in order to allow for the simultaneous comparison of more than two corpora, while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion

    Lexical frequency and vocabulary sequencing in Spanish graded readers

    This article examines the distribution of words and collocations in a corpus of Spanish graded readers across different levels of proficiency. The main aim is to verify whether there is any relation between lexical frequency as registered in a general corpus of Spanish and the distribution of vocabulary items (single words and collocations) in texts of different levels, such that there is an increase of infrequent items as the proficiency level rises. Such a relation cannot be taken for granted in the case of Spanish graded readers, since a review of the literature suggests that factors other than vocabulary selection (namely, grammatical features) have been given priority in creating texts for a given proficiency level.El presente artículo estudia la distribución de palabras y colocaciones presentes en un corpus de lecturas graduadas del español a través de los distintos niveles de aprendizaje. El objetivo principal es verificar si se da una correlación entre la frecuencia léxica tal como se registra en un corpus del español general y la distribución de los elementos del vocabulario (formas univerbales y pluriverbales) en textos de diferente nivel, de manera que los elementos infrecuentes sean más numerosos conforme el nivel sube. Esta correlación no se puede dar por supuesta en las lecturas graduadas en español, pues un repaso a la bibliografía relevante indica que se ha dado prioridad a factores distintos a la selección del vocabulario en la creación de este tipo de materiales (en concreto, al componente gramatical)

    Induktive Kategorienbildung in der Inhaltsanalyse: Kombination automatischer und manueller Verfahren

    Kernstück jeder Inhaltsanalyse ist ein Kategoriensystem, das häufig induktiv-qualitativ an einer kleinen Stichprobe von Texten entwickelt wird. Methoden des Text Mining ermöglichen es heute, eine nahezu unbegrenzte Anzahl an Texten effizient, schnell und nachvollziehbar zu explorieren. In diesem Beitrag wird ein Verfahren vorgeschlagen, bei dem solche Methoden eingesetzt werden, um induktiv aus einem umfangreichen Textkorpus Kategorien für eine Inhaltsanalyse zu bilden. Diese Methoden werden mit einer qualitativen, manuellen Inhaltsanalyse kombiniert. Die Kombination verschiedener Verfahren besteht darin, dass zunächst mittels Text Mining thematische Oberkategorien aus einem vorliegenden Textkorpus extrahiert, anschließend manuell validiert und in einer qualitativen Inhaltsanalyse um Unterkategorien erweitert wurden. Das Vorgehen wird beispielhaft an einem Codebuch erläutert, welches im Rahmen der Auswertung des "Bürgerdialogs" der Bundesregierung "Gut leben in Deutschland" zum Thema Lebensqualität entwickelt und angewendet wurde

    Sosiolingvistisen vaihtelun tarkastelu englannin sananmuodostuksessa historiallisen korpustutkimuksen keinoin

    This dissertation studies how the productivity of word-formation varies across social groups in the history of the English language. Previous research into variation and change within the morphological productivity of derivational affixes has been hampered by the lack of suitable methods for comparing productivity measures across subcorpora. A further problem has been how to assess the statistical significance of the differences observed. The latter issue is also present in comparisons of word frequencies in diachronic corpus linguistics: previous work has tended to use tests which make the invalid assumption that words occur randomly in texts. Moreover, the question often arises whether the change observed is linguistic, stylistic or an artefact of the corpus. The present work explores sociolinguistic variation and change in the morphological productivity of the nominal suffixes -ness and -ity from Early Modern English to Present-day English, using materials such as the Corpora of Early English Correspondence and the British National Corpus. To do this, it employs robust methods to compare item frequencies over time and across social categories. Developed in collaboration with computer scientists, the methods include non-parametric measures of statistical significance as well as visualisations revealing variability within (sub)corpora and facilitating exploration. In addition to research into individual linguistic features, the methods can be used to compare corpora and study genre continuity at the levels of vocabulary and parts of speech. Besides corpus-linguistic methodology, the work contributes to the theory and description of derivational productivity. Firstly, it shows that each of the social categories studied - gender, social rank, and register in terms of participant relations - may have an influence on productivity, gender being the most consistent factor in the case of -ity. Furthermore, it shows that while productivity measures based on the frequency of hapax legomena, or words occurring only once in the corpus, are unusable in small corpora, they do function as expected in large corpora and remain theoretically valid. These findings should be taken into account in future research, and it is to be hoped that future studies will be significantly facilitated by the methodological contributions presented in this dissertation.Tämä väitöskirja tutkii laajojen elektronisten tekstikorpusten avulla, miten sananmuodostuksen produktiivisuus eli uusien sanojen tuottamisen todennäköisyys vaihtelee eri sosiaaliryhmien välillä englannin kielen historiassa. Kirjassa tarkastellaan, kuinka paljon vaihtelua ja millaisia muutoksia esiintyy englannin substantiivijohtimien -ness ja -ity produktiivisuudessa 1600-luvulta nykypäivään. Tutkimustulokset osoittavat, että johtimien produktiivisuuteen voivat vaikuttaa kaikki tutkituista sosiaalisista kategorioista: kielenkäyttäjien sukupuoli ja yhteiskuntaluokka sekä viestintätilanteen osallistujien väliset suhteet. Englantiin ranskasta ja latinasta lainautuneen -ity-johtimen kannalta merkittävin kategoria on sukupuoli, sillä johtimen käyttö on miesvaltaista jokaisena aineistojen kattamana aikakautena. Tämä saattaa selittyä sukupuolittuneilla kirjoitustyyleillä. Kotoperäisen -ness-johtimen produktiivisuuden vaihtelu on vähäisempää. Produktiivisuuden vaihtelua ja muutosta on aiemmin ollut vaikeaa tutkia korpuslingvistisesti, koska mittaustulosten vertailuun ei ole ollut sopivia menetelmiä. Ongelmallista on ollut myös vaihtelun tilastollisen merkitsevyyden määrittäminen. Sama ongelma on vaivannut kielen muutoksen tutkimusta yleisemminkin. Lisäksi on usein epäselvää, liittyykö havaittu muutos kielen vai kirjoitustyylin muuttumiseen, vai johtuuko se aineiston epätasaisuudesta. Tässä tutkimuksessa käytetään uusia, yhteistyössä tietojenkäsittelytieteilijöiden kanssa kehitettyjä menetelmiä, jotka mahdollistavat kielellisten piirteiden esiintymistiheyden luotettavan vertailun eri ajanjaksojen ja sosiaaliryhmien välillä. Tilastollisen merkitsevyyden mittareiden lisäksi esitellään visualisointimenetelmiä, joiden avulla voidaan perehtyä aineistojen sisäiseen vaihteluun ja löytää uusia tutkimuskohteita. Menetelmiä voidaan käyttää myös aineistojen vertailuun sekä tekstilajien muutoksen tutkimiseen

    The transformation of China’s legal profession and its representation: a critical discourse analysis

    This thesis argues that hegemonical struggles between the colonization and adaptation forces in the process of naturalizing the implanted Western legal profession in China set a fundamental context without which the transformation of the Chinese legal profession cannot be fully understood. The thesis also argues that LegalTech, which is embedded in the digital transformation of nearly everything in today’s society, has enabled various social groups (that were once excluded from the legal industry by various professional monopoly mechanisms) to successfully penetrate into the Chinese legal field. Different groups of field players compete to construct discourses of professionalism to legitimate their ways of producing the legal services and organize the producers. This research conducted a corpus assisted critical discourse analysis, coupled with the framing analysis, to excavate the frames that some British and Chinese newspapers had utilized to advocate different versions of professionalism in their competitive framing of the same series of lawyer detention events that happened in China between 2015 to 2018. This research employed the same methodology to find the frames that various kinds of publications had deployed to organize ideas around LegalTech, especially the discourses on the implications of the rise of LegalTech to legal services production, access to justice, and the existential state of the legal professionals. British newspapers developed a “war on law” frame to cover the series of lawyer detention events in China. Chinese newspapers constructed a counter frame of “law and order for the lawyers” to organize the news on the same events. This research identified an “access to justice” frame that argues LegalTech can improve the efficiency and effectiveness of legal service production and widen people’s access to justice. There is also a “disruptive innovation” frame that focuses on the disruptive effects that LegalTech bring to the old ways of legal services production and the existential state of the traditional legal professionals