Search CORE

2 research outputs found

Collocations on the way : How words come together in Russian

Author: Kormacheva Daria
Publication venue: 'University of Helsinki Libraries'
Publication date: 15/06/2020
Field of study

This thesis addresses the topic of collocations and their behavior based on Russian language data. In the course of four articles, I develop a better understanding of collocations that is based on a corpus-driven approach. Collocations are defined as statistically significant co-occurrences of tokens or lexemes within a syntactic phrase that are extracted by statistics-based automatic analysis tools and are restricted to various extents: from semantically not-idiomatic to full idioms. In the article “Evaluation of collocation extraction methods for the Russian language” (2017), my co-authors and I discuss of the methods used to extract statistical collocations and provide results pertaining to the comparison of five metrics for extracting statistics-based collocations as well as the raw frequency. First, this research has demonstrated that the results of the discussed metrics are often correlated and, second, that the degree of idiomaticity of the extracted units varies significantly. In “What do we get from extracting collocations? Linguistic analysis of automatically obtained Russian MWEs” (2015), I offer a comparison of the empirical and phraseological perspectives on collocations and introduce research where I attempt to position empirical collocations within the scope of a phraseological theory. This research demonstrates that empirical collocations have different tendencies to form idiomatic lexical units and I reveal the shortcomings of describing the idiomaticity of expressions in terms of strict classes. In “Choosing between lexeme vs. token in Russian collocations” (2019), I examine grammatical profiling as a method used to define the optimal level of representation for collocations. I have demonstrated that collocations have different distributional preferences across the corpus. I have also analyzed the relationship between token and lexeme collocations based on the degree to which their grammatical profiles resemble the grammatical profiles of their headwords (although the border between the two types is not clear-cut). I also offered a plausible method of differentiating between these two collocation types. Finally, in “Constructional generalization over Russian collocations” (2016) my co-authors and I present the main concepts of Construction Grammar and introduce the research where a substantial number of automatically extracted collocations were demonstrated to form clusters of words that belong to the same semantic class, even when they are not idiomatic. Such constructional generalizations have shown that there is a more abstract level on which collocations can be stable as a class rather than on the level of single collocations.Tarkastelen väitöskirjassani kollokaatioita ja niiden käyttäytymistä venäjänkielisessä tutkimusaineistossa. Väitöskirja koostuu neljästä artikkelista, joissa pyrin tuomaan esille korpusvetoiseen tutkimukseen pohjautuvan näkemykseni kollokaatioista. Kollokaatio määritellään työssä lausekkeessa ilmeneväksi tilastollisesti merkitseväksi sanaesiintymien tai lekseemien yhteisesiintymäksi, joka on erotettavissa juoksevasta tekstistä käyttämällä tilastopohjaista automaattista analyysityökalua ja jossa on havaittavissa tiettyjä käyttöä rajoittavia tekijöitä. Nämä tekijät voivat vaihdella ei-idiomaattisista semanttisista rajoituksista täysin idiomatisoituneisiin ilmauksiin. Artikkelissa Evaluation of collocation extraction methods for the Russian language (2017) käsittelen yhdessä muiden kirjoittajien kanssa erilaisia tapoja, joilla automaattisesti erotettuja kollokaatioita voidaan evaluoida, ja vertaan viiden tilastolliseen analyysiin perustuvan kollokaatioiden erottelutyökalun antamia tuloksia venäjänkielisessä raakadatassa ilmeneviin frekvensseihin. Tutkimus osoittaa, että eri työkalujen antamat tulokset korreloivat keskenään ja että niiden avulla erotettujen yksiköiden idiomaattisuuden aste vaihtelee merkittävästi. Artikkelissa What do we get from extracting collocations? Linguistic analysis of automatically obtained Russian MWEs (2015) vertailen kollokaation tulkintaa empiirisessä korpustutkimuksessa ja fraseologian teorioissa ja pyrin selvittämään empiirisin menetelmin erotettujen kollokaatioiden asemaa fraseologian tutkimuksen tarjoamassa teoreettisessa viitekehyksessä. Tutkimukseni osoittaa, että empiirisin keinoin erotetuilla kollokaatioilla on toisistaan poikkeavia taipumuksia muodostaa idiomaattisia leksikaalisia yksiköitä, minkä seurauksena idiomaattisten ilmausten jakaminen selvästi erillisiin luokkiin on ongelmallista. Artikkelissani Choosing between lexeme vs. token in Russian collocations (2019) pyrin osoittamaan, että ilmausten kieliopillinen profilointi on keino, jolla voidaan määritellä kollokaatioiden representaatio optimaalisella tavalla. Tutkimukseni tulokset osoittavat, että kollokaatioiden distributionaaliset preferenssit eroavat korpusaineistossa ja että sanaesiintymä- ja lekseemikollokaatioiden välistä suhdetta on mahdollista selvittää vertaamalla kollokaation ja sen pääsanan kieliopillisia profiileja. Kahden kollokaatiotyypin väliseen rajanvetoon liittyy tiettyjä ongelmia, joiden ratkaisemiseen tarjoan artikkelissa käyttökelpoisen ja toimivan menetelmän. Artikkelissa Constructional generalization over Russian collocations (2016) kollokaatioita tarkastellaan konstruktiokieliopin viitekehyksessä ja osoitetaan, että suuri osa automaattisesti erotetuista kollokaatioista muodostaa samaan semanttiseen kenttään kuuluvien sanojen klustereita, vaikka sanoihin ei välttämättä liittyisikään idiomaattisuutta. Tällaiset konstruktioita koskevat yleistykset ovat osoittaneet, että on olemassa yksittäisiä kollokaatioita ylempi abstraktiotaso, jolla kollokaatiot voidaan hahmottaa luokkina

Helsingin yliopiston digitaalinen arkisto

The Prime Machine: a user-friendly corpus tool for English language teaching and self-tutoring based on the Lexical Priming theory of language

Author: Jeaco Stephen Mark
Publication venue
Publication date
Field of study

This thesis presents the design and evaluation of a new concordancer called The Prime Machine which has been developed as an English language learning and teaching tool. The software has been designed to provide learners with a multitude of examples from corpus texts and additional information about the contextual environment in which words and combinations of words tend to occur. The prevailing view of how language operates has been that grammar and lexis are separate systems and sentences can be constructed merely by choosing any syntactic structure and slotting in vocabulary. Over the last few decades, however, corpus linguistics has presented challenges to this view of language, drawing on evidence which can be found in the patterning of language choices in texts. Nevertheless, despite some reports of success from researchers in this area, only a limited number of teachers and learners of second language seem to make direct use of corpus software tools. The desire to develop a new corpus tool grew out of professional experience as an English language teacher and manager in China. This thesis begins by introducing some background information about the role of English in international higher education and the language learning context in China, and then goes on to describe the software architecture and the process by which corpus texts are transformed from their raw state into rows of data in a sophisticated database to be accessed by the concordancer. It then introduces innovations including several aspects of the search screen interface, the concordance line display and the use of collocation data. The software provides a rich learning platform for language learners to independently look up and compare similar words, different word forms, different collocations and the same words across two corpora. Underpinning the design is a view of language which draws on Michael Hoey's theory of Lexical Priming. The software is designed to make it possible to see tendencies of words and phrases which are not usually apparent in either dictionary examples or the output from other concordancing software. The design features are considered from a pedagogical perspective, focusing on English for Academic Purposes and including important software design principles from Computer Aided Language Learning. Through a small evaluation involving undergraduate students, the software has been shown to have great potential as a tool for the writing process. It is believed that The Prime Machine will be a very useful corpus tool which, while simple to operate, provides a wealth of information for English language teaching and self-tutoring

University of Liverpool Repository