4 research outputs found

    Collocation ranking: frequency vs semantics

    Get PDF
    Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar

    Semantics-driven recognition of collocations using word embeddings

    No full text
    L2 learners often produce “ungrammatical” word combinations such as, e.g., *give a suggestion or *make a walk. This is because of the “collocationality” of one of their items (the base) that limits the acceptance of collocates to express a specific meaning (‘perform’ above). We propose an algorithm that delivers, for a given base and the intended meaning of a collocate, the actual collocate lexeme(s) (make / take above). The algorithm exploits the/nlinear mapping between bases and collocates from examples and generates a collocation transformation matrix which is then applied to novel unseen cases. The evaluation shows a promising line of research in collocation discovery.The present work has been partially funded by the Spanish Ministry of Economy and Competitiveness/n(MINECO), through a predoctoral grant (BES-2012-057036) in the framework of the project HARenES (FFI2011-30219-C02-02), and by the European Commission under the grant number H2020–645012–RIA. We also acknowledge support from the Maria de Maeztu Excellence Program (MDM-2015-0502)

    Semantics-driven recognition of collocations using word embeddings

    No full text
    L2 learners often produce “ungrammatical” word combinations such as, e.g., *give a suggestion or *make a walk. This is because of the “collocationality” of one of their items (the base) that limits the acceptance of collocates to express a specific meaning (‘perform’ above). We propose an algorithm that delivers, for a given base and the intended meaning of a collocate, the actual collocate lexeme(s) (make / take above). The algorithm exploits the/nlinear mapping between bases and collocates from examples and generates a collocation transformation matrix which is then applied to novel unseen cases. The evaluation shows a promising line of research in collocation discovery.The present work has been partially funded by the Spanish Ministry of Economy and Competitiveness/n(MINECO), through a predoctoral grant (BES-2012-057036) in the framework of the project HARenES (FFI2011-30219-C02-02), and by the European Commission under the grant number H2020–645012–RIA. We also acknowledge support from the Maria de Maeztu Excellence Program (MDM-2015-0502)

    Meaning in poetry: semantic annotation of verse with the Historical Thesaurus of English

    Get PDF
    This thesis addresses the current gap in semantic annotation of poetry by presenting the first semantic tagging system specifically designed to disambiguate senses in a diachronic corpus of poetry. The ‘ambiguity tagger’ developed for this purpose utilises the hierarchical taxonomy of the Historical Thesaurus of English (HTE; Kay 2011: 42) to assign conceptual ‘tags’ to lexical items that denote the meaning of the word in context, with multiple meanings assigned to ambiguous words. The ambiguity tagger encompasses a configurable pipeline for semantic annotation, thus presenting a more flexible alternative to existing applications (Piao et al. 2005a; Rayson 2009a; Piao et al. 2017). To train the tagger, a corpus was curated from the Oxford Book of English Verse, containing poetry from the early 16th to the late 19th century (OBEV; Quiller-Couch 1919/1999). As the ambiguity tagger allows multiple meanings to be assigned to individual words in the corpus, without restricting the maximum number of senses, the semantic metadata produced by the tagger is unique in its breadth. Correspondingly, the analysis sections of the thesis look at different techniques for interpreting the data, using case studies from the OBEV corpus. Both macro- and micro-level approaches to analysing the data are explored, highlighting the benefits of the ambiguity tagger at different levels of critical analysis. To further explore the capabilities of semantic annotation with HTE data, this research extends the interpretative analysis of the semantic metadata gained through the ambiguity tagger by presenting a systematic approach for analysing the significant co-occurrence of concepts in the text. This process borrows the framework for identifying significantly co-occurring words (collocates) and extends this into a measure of ‘semantic collocation’, thus significantly expanding on existing research in this field (Alexander et al. 2015a; Archer & Malory 2015; 2017). By shifting the focus from lexical collocation to the significant co-occurrence of ‘meaning’ in texts, this approach reveals a pattern of previously inaccessible textual data for analysis and marks a further methodological contribution of this research
    corecore