173 research outputs found
Discovering multiword expressions
In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods
Geometry of compositionality
Word embedding is a popular representation of words in vector space, and its geometry reveals the lexical semantics. This thesis further explores the interesting geometric properties of word embedding, and looks into its interaction with the context representation. We propose an innovative method to detect whether a given word or phrase is used literally in a specific context. This work focuses on three specific applications in natural language processing: idiomaticity, sarcasm and metaphor detection. Extensive experiments have shown that this embedding-based method achieves good performance in multiple languages
Exploiting multilingual lexical resources to predict MWE compositionality
Semantic idiomaticity is the extent to which the meaning of a multiword expression (MWE) cannot be predicted from the meanings of its component words. Much
work in natural language processing on semantic idiomaticity has focused on compositionality prediction, wherein a binary or continuous-valued compositionality
score is predicted for an MWE as a whole, or its individual component words. One
source of information for making compositionality predictions is the translation
of an MWE into other languages. This chapter extends two previously-presented
studies – Salehi & Cook (2013) and Salehi et al. (2014) – that propose methods for
predicting compositionality that exploit translation information provided by multilingual lexical resources, and that are applicable to many kinds of MWEs in a
wide range of languages. These methods make use of distributional similarity of
an MWE and its component words under translation into many languages, as well
as string similarity measures applied to definitions of translations of an MWE and
its component words. We evaluate these methods over English noun compounds,
English verb-particle constructions, and German noun compounds. We show that
the estimation of compositionality is improved when using translations into multiple languages, as compared to simply using distributional similarity in the source
language. We further find that string similarity complements distributional similarity
The automatic processing of multiword expressions in Irish
It is well-documented that Multiword Expressions (MWEs) pose a unique challenge
to a variety of NLP tasks such as machine translation, parsing, information retrieval,
and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to
improve handling of MWEs in various NLP tasks for Irish, this thesis will address
both the lack of resources specifically targeting MWEs in Irish, and examine how
these resources can be applied to said NLP tasks.
We report on the creation and analysis of a number of lexical resources as part
of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract-
ing MWEs from other lexical resources such as dictionaries. A corpus annotated
with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME
Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus
for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based
on combining linguistic analysis on these types of constructions and cross-lingual
frameworks for defining MWEs.
A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation
systems. Finally, the topic of automatic identification of Irish MWEs is explored,
documenting the training of a system capable of automatically identifying Irish
MWEs from a variety of categories, and the challenges associated with developing
such a system.
This research contributes towards a greater understanding of Irish MWEs and
their applications in NLP, and provides a foundation for future work in exploring
other methods for the automatic discovery and identification of Irish MWEs, and
further developing the MWE resources described above
Multiword expressions
Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar
Lexikální idiomy v angličtině
Podle standardní definice se frazeologie zabývá víceslovnými lexikálními jednotkami, tzn. kombinacemi slov. Hlasy volající po tom, že i komplexní slova složená ze dvou či více významových jednotek mohou mít status (lexikálních) frazémů/idiomů, zvláště je-li jejich význam nekompozicionální, jsou stále dosti izolované, a to i přes to, že lingvistická literatura se hemží zmínkami o idiomatických kompozitech a derivátech (Kap. 3). Zdá se, že jediné systematické pojednání o lexikálních idiomech podává František Čermák (2007), který se zaměřuje především na lexikální idiomy v češtině. Cílem této práce je proto prozkoumat situaci v angličtině a pokusit se vytvořit nosnou definici a zejména kritéria pro odlišení lexikálních idiomů od ostatních komplexních lexémů a nastínit hlavní typy těchto idiomů v angličtině. Po úvodu (Kap. 1) a seznámení se současnými proudy ve frazeologii a relevantními poznatky o frazeologických jednotkách a jejich rysech (Kap. 2), referuje práce o Čermákově teorii lexikálních idiomů a kvantitativní studii, kterou jeho teorie inspirovala (Kap. 4). Jádrem práce je analýza dvou vzorků. První byl vybrán z BNC a představuje náhodný výběr 1000 jednoslovných lemmat. Sloužil jako testovací vzorek nejen pro odlišení simplexních lemmat od komplexních, ale zejména pro zjišťování potenciálních...According to the standard definition phraseology deals with multi-word lexical units, i.e. word combinations. Voices claiming that even complex words composed of two or more meaningful units may qualify for the status of (lexical) phrasemes/idioms, especially when their meaning is non- compositional, are still very isolated, in spite of the fact that linguistic literature is teeming with references to idiomatic compounds and derivatives (Chap. 3). In fact, the only systematic treatment of lexical idioms seems to be that offered by Čermák (2007), who focuses primarily on lexical idioms in Czech. The aim of the thesis is therefore to explore the situation in English and attempt to develop a useful definition of, and especially criteria for, distinguishing lexical idioms from other complex lexemes and provide an outline of the main types of lexical idioms obtaining in English. After an introduction (Chap. 1) and the presentation of state-of-the-art approaches to phraseology and the relevant information about phraseological units and their features (Chap. 2), the thesis reviews Čermák's theory of lexical idioms which inspired their quantitative study in Czech (Chap. 4). The core part is the analysis of two samples. The first one, gathered from the BNC, includes a random selection of 1000 single-word...Ústav anglického jazyka a didaktikyDepartment of the English Language and ELT MethodologyFaculty of ArtsFilozofická fakult
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
Unsupervised compositionality prediction of nominal compounds
Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results
- …