266 research outputs found

    Collocations on the way : How words come together in Russian

    Get PDF
    This thesis addresses the topic of collocations and their behavior based on Russian language data. In the course of four articles, I develop a better understanding of collocations that is based on a corpus-driven approach. Collocations are defined as statistically significant co-occurrences of tokens or lexemes within a syntactic phrase that are extracted by statistics-based automatic analysis tools and are restricted to various extents: from semantically not-idiomatic to full idioms. In the article “Evaluation of collocation extraction methods for the Russian language” (2017), my co-authors and I discuss of the methods used to extract statistical collocations and provide results pertaining to the comparison of five metrics for extracting statistics-based collocations as well as the raw frequency. First, this research has demonstrated that the results of the discussed metrics are often correlated and, second, that the degree of idiomaticity of the extracted units varies significantly. In “What do we get from extracting collocations? Linguistic analysis of automatically obtained Russian MWEs” (2015), I offer a comparison of the empirical and phraseological perspectives on collocations and introduce research where I attempt to position empirical collocations within the scope of a phraseological theory. This research demonstrates that empirical collocations have different tendencies to form idiomatic lexical units and I reveal the shortcomings of describing the idiomaticity of expressions in terms of strict classes. In “Choosing between lexeme vs. token in Russian collocations” (2019), I examine grammatical profiling as a method used to define the optimal level of representation for collocations. I have demonstrated that collocations have different distributional preferences across the corpus. I have also analyzed the relationship between token and lexeme collocations based on the degree to which their grammatical profiles resemble the grammatical profiles of their headwords (although the border between the two types is not clear-cut). I also offered a plausible method of differentiating between these two collocation types. Finally, in “Constructional generalization over Russian collocations” (2016) my co-authors and I present the main concepts of Construction Grammar and introduce the research where a substantial number of automatically extracted collocations were demonstrated to form clusters of words that belong to the same semantic class, even when they are not idiomatic. Such constructional generalizations have shown that there is a more abstract level on which collocations can be stable as a class rather than on the level of single collocations.Tarkastelen väitöskirjassani kollokaatioita ja niiden käyttäytymistä venäjänkielisessä tutkimusaineistossa. Väitöskirja koostuu neljästä artikkelista, joissa pyrin tuomaan esille korpusvetoiseen tutkimukseen pohjautuvan näkemykseni kollokaatioista. Kollokaatio määritellään työssä lausekkeessa ilmeneväksi tilastollisesti merkitseväksi sanaesiintymien tai lekseemien yhteisesiintymäksi, joka on erotettavissa juoksevasta tekstistä käyttämällä tilastopohjaista automaattista analyysityökalua ja jossa on havaittavissa tiettyjä käyttöä rajoittavia tekijöitä. Nämä tekijät voivat vaihdella ei-idiomaattisista semanttisista rajoituksista täysin idiomatisoituneisiin ilmauksiin. Artikkelissa Evaluation of collocation extraction methods for the Russian language (2017) käsittelen yhdessä muiden kirjoittajien kanssa erilaisia tapoja, joilla automaattisesti erotettuja kollokaatioita voidaan evaluoida, ja vertaan viiden tilastolliseen analyysiin perustuvan kollokaatioiden erottelutyökalun antamia tuloksia venäjänkielisessä raakadatassa ilmeneviin frekvensseihin. Tutkimus osoittaa, että eri työkalujen antamat tulokset korreloivat keskenään ja että niiden avulla erotettujen yksiköiden idiomaattisuuden aste vaihtelee merkittävästi. Artikkelissa What do we get from extracting collocations? Linguistic analysis of automatically obtained Russian MWEs (2015) vertailen kollokaation tulkintaa empiirisessä korpustutkimuksessa ja fraseologian teorioissa ja pyrin selvittämään empiirisin menetelmin erotettujen kollokaatioiden asemaa fraseologian tutkimuksen tarjoamassa teoreettisessa viitekehyksessä. Tutkimukseni osoittaa, että empiirisin keinoin erotetuilla kollokaatioilla on toisistaan poikkeavia taipumuksia muodostaa idiomaattisia leksikaalisia yksiköitä, minkä seurauksena idiomaattisten ilmausten jakaminen selvästi erillisiin luokkiin on ongelmallista. Artikkelissani Choosing between lexeme vs. token in Russian collocations (2019) pyrin osoittamaan, että ilmausten kieliopillinen profilointi on keino, jolla voidaan määritellä kollokaatioiden representaatio optimaalisella tavalla. Tutkimukseni tulokset osoittavat, että kollokaatioiden distributionaaliset preferenssit eroavat korpusaineistossa ja että sanaesiintymä- ja lekseemikollokaatioiden välistä suhdetta on mahdollista selvittää vertaamalla kollokaation ja sen pääsanan kieliopillisia profiileja. Kahden kollokaatiotyypin väliseen rajanvetoon liittyy tiettyjä ongelmia, joiden ratkaisemiseen tarjoan artikkelissa käyttökelpoisen ja toimivan menetelmän. Artikkelissa Constructional generalization over Russian collocations (2016) kollokaatioita tarkastellaan konstruktiokieliopin viitekehyksessä ja osoitetaan, että suuri osa automaattisesti erotetuista kollokaatioista muodostaa samaan semanttiseen kenttään kuuluvien sanojen klustereita, vaikka sanoihin ei välttämättä liittyisikään idiomaattisuutta. Tällaiset konstruktioita koskevat yleistykset ovat osoittaneet, että on olemassa yksittäisiä kollokaatioita ylempi abstraktiotaso, jolla kollokaatiot voidaan hahmottaa luokkina

    Constructional generalization over Russian collocations

    Get PDF
    The CoCoCo project aims to model multi-word expressions (MWEs) of diverse natures in a unified fashion. The algorithm predicts the most stable features in an n-gram—morphological, lexical, or constructional. In this article, we focus more on lexical compatibility of extracted collocations. At one extreme are lexically stable idioms, where no generalization is possible, e.g., lo and behold. Other collocations appear to be stable on a more abstract level of generalization. They are constructions where lexical items are replaceable but belong to the same semantic class, e.g., sleight of [hand/mouth/mind]. In this case, prediction of the entire semantic class is possible. To confirm this idea, we present a qualitative analysis of automatically extracted Russian MWEs. We then use distributional semantics methods to find semantic classes automatically and demonstrate that these correspond with manually annotated classes. This implies that the semantic classes can be used in the collocation detection algorithm.Peer reviewe

    Computational approaches to semantic change

    Get PDF
    Semantic change â€” how the meanings of words change over time â€” has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" â€” and practical applications, such as information retrieval in longitudinal text archives

    Computational approaches to semantic change

    Get PDF
    Semantic change â€” how the meanings of words change over time â€” has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" â€” and practical applications, such as information retrieval in longitudinal text archives

    Computational approaches to semantic change

    Get PDF
    Semantic change â€” how the meanings of words change over time â€” has preoccupied scholars since well before modern linguistics emerged in the late 19th and early 20th century, ushering in a new methodological turn in the study of language change. Compared to changes in sound and grammar, semantic change is the least  understood. Ever since, the study of semantic change has progressed steadily, accumulating a vast store of knowledge for over a century, encompassing many languages and language families. Historical linguists also early on realized the potential of computers as research tools, with papers at the very first international conferences in computational linguistics in the 1960s. Such computational studies still tended to be small-scale, method-oriented, and qualitative. However, recent years have witnessed a sea-change in this regard. Big-data empirical quantitative investigations are now coming to the forefront, enabled by enormous advances in storage capability and processing power. Diachronic corpora have grown beyond imagination, defying exploration by traditional manual qualitative methods, and language technology has become increasingly data-driven and semantics-oriented. These developments present a golden opportunity for the empirical study of semantic change over both long and short time spans. A major challenge presently is to integrate the hard-earned  knowledge and expertise of traditional historical linguistics with  cutting-edge methodology explored primarily in computational linguistics. The idea for the present volume came out of a concrete response to this challenge.  The 1st International Workshop on Computational Approaches to Historical Language Change (LChange'19), at ACL 2019, brought together scholars from both fields. This volume offers a survey of this exciting new direction in the study of semantic change, a discussion of the many remaining challenges that we face in pursuing it, and considerably updated and extended versions of a selection of the contributions to the LChange'19 workshop, addressing both more theoretical problems —  e.g., discovery of "laws of semantic change" â€” and practical applications, such as information retrieval in longitudinal text archives
    • …
    corecore