173 research outputs found

    Discovering multiword expressions

    Get PDF
    In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods

    Geometry of compositionality

    Get PDF
    Word embedding is a popular representation of words in vector space, and its geometry reveals the lexical semantics. This thesis further explores the interesting geometric properties of word embedding, and looks into its interaction with the context representation. We propose an innovative method to detect whether a given word or phrase is used literally in a specific context. This work focuses on three specific applications in natural language processing: idiomaticity, sarcasm and metaphor detection. Extensive experiments have shown that this embedding-based method achieves good performance in multiple languages

    Exploiting multilingual lexical resources to predict MWE compositionality

    Get PDF
    Semantic idiomaticity is the extent to which the meaning of a multiword expression (MWE) cannot be predicted from the meanings of its component words. Much work in natural language processing on semantic idiomaticity has focused on compositionality prediction, wherein a binary or continuous-valued compositionality score is predicted for an MWE as a whole, or its individual component words. One source of information for making compositionality predictions is the translation of an MWE into other languages. This chapter extends two previously-presented studies – Salehi & Cook (2013) and Salehi et al. (2014) – that propose methods for predicting compositionality that exploit translation information provided by multilingual lexical resources, and that are applicable to many kinds of MWEs in a wide range of languages. These methods make use of distributional similarity of an MWE and its component words under translation into many languages, as well as string similarity measures applied to definitions of translations of an MWE and its component words. We evaluate these methods over English noun compounds, English verb-particle constructions, and German noun compounds. We show that the estimation of compositionality is improved when using translations into multiple languages, as compared to simply using distributional similarity in the source language. We further find that string similarity complements distributional similarity

    The automatic processing of multiword expressions in Irish

    Get PDF
    It is well-documented that Multiword Expressions (MWEs) pose a unique challenge to a variety of NLP tasks such as machine translation, parsing, information retrieval, and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to improve handling of MWEs in various NLP tasks for Irish, this thesis will address both the lack of resources specifically targeting MWEs in Irish, and examine how these resources can be applied to said NLP tasks. We report on the creation and analysis of a number of lexical resources as part of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract- ing MWEs from other lexical resources such as dictionaries. A corpus annotated with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based on combining linguistic analysis on these types of constructions and cross-lingual frameworks for defining MWEs. A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation systems. Finally, the topic of automatic identification of Irish MWEs is explored, documenting the training of a system capable of automatically identifying Irish MWEs from a variety of categories, and the challenges associated with developing such a system. This research contributes towards a greater understanding of Irish MWEs and their applications in NLP, and provides a foundation for future work in exploring other methods for the automatic discovery and identification of Irish MWEs, and further developing the MWE resources described above

    Multiword expressions

    Get PDF
    Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar

    Lexikální idiomy v angličtině

    Get PDF
    Podle standardní definice se frazeologie zabývá víceslovnými lexikálními jednotkami, tzn. kombinacemi slov. Hlasy volající po tom, že i komplexní slova složená ze dvou či více významových jednotek mohou mít status (lexikálních) frazémů/idiomů, zvláště je-li jejich význam nekompozicionální, jsou stále dosti izolované, a to i přes to, že lingvistická literatura se hemží zmínkami o idiomatických kompozitech a derivátech (Kap. 3). Zdá se, že jediné systematické pojednání o lexikálních idiomech podává František Čermák (2007), který se zaměřuje především na lexikální idiomy v češtině. Cílem této práce je proto prozkoumat situaci v angličtině a pokusit se vytvořit nosnou definici a zejména kritéria pro odlišení lexikálních idiomů od ostatních komplexních lexémů a nastínit hlavní typy těchto idiomů v angličtině. Po úvodu (Kap. 1) a seznámení se současnými proudy ve frazeologii a relevantními poznatky o frazeologických jednotkách a jejich rysech (Kap. 2), referuje práce o Čermákově teorii lexikálních idiomů a kvantitativní studii, kterou jeho teorie inspirovala (Kap. 4). Jádrem práce je analýza dvou vzorků. První byl vybrán z BNC a představuje náhodný výběr 1000 jednoslovných lemmat. Sloužil jako testovací vzorek nejen pro odlišení simplexních lemmat od komplexních, ale zejména pro zjišťování potenciálních...According to the standard definition phraseology deals with multi-word lexical units, i.e. word combinations. Voices claiming that even complex words composed of two or more meaningful units may qualify for the status of (lexical) phrasemes/idioms, especially when their meaning is non- compositional, are still very isolated, in spite of the fact that linguistic literature is teeming with references to idiomatic compounds and derivatives (Chap. 3). In fact, the only systematic treatment of lexical idioms seems to be that offered by Čermák (2007), who focuses primarily on lexical idioms in Czech. The aim of the thesis is therefore to explore the situation in English and attempt to develop a useful definition of, and especially criteria for, distinguishing lexical idioms from other complex lexemes and provide an outline of the main types of lexical idioms obtaining in English. After an introduction (Chap. 1) and the presentation of state-of-the-art approaches to phraseology and the relevant information about phraseological units and their features (Chap. 2), the thesis reviews Čermák's theory of lexical idioms which inspired their quantitative study in Czech (Chap. 4). The core part is the analysis of two samples. The first one, gathered from the BNC, includes a random selection of 1000 single-word...Ústav anglického jazyka a didaktikyDepartment of the English Language and ELT MethodologyFaculty of ArtsFilozofická fakult

    Current trends

    Get PDF
    Deep parsing is the fundamental process aiming at the representation of the syntactic structure of phrases and sentences. In the traditional methodology this process is based on lexicons and grammars representing roughly properties of words and interactions of words and structures in sentences. Several linguistic frameworks, such as Headdriven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different structures and combining operations for building grammar rules. These already contain mechanisms for expressing properties of Multiword Expressions (MWE), which, however, need improvement in how they account for idiosyncrasies of MWEs on the one hand and their similarities to regular structures on the other hand. This collaborative book constitutes a survey on various attempts at representing and parsing MWEs in the context of linguistic theories and applications

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Unsupervised compositionality prediction of nominal compounds

    Get PDF
    Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results
    corecore