263 research outputs found

    Unsupervised compositionality prediction of nominal compounds

    Get PDF
    Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results

    Discovering multiword expressions

    Get PDF
    In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods

    Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation

    Full text link
    In this paper, we empirically evaluate the utility of transfer and multi-task learning on a challenging semantic classification task: semantic interpretation of noun--noun compounds. Through a comprehensive series of experiments and in-depth error analysis, we show that transfer learning via parameter initialization and multi-task learning via parameter sharing can help a neural classification model generalize over a highly skewed distribution of relations. Further, we demonstrate how dual annotation with two distinct sets of relations over the same set of compounds can be exploited to improve the overall accuracy of a neural classifier and its F1 scores on the less frequent, but more difficult relations.Comment: EMNLP 2018: Conference on Empirical Methods in Natural Language Processing (EMNLP

    Eesti keele ühendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    Tänapäeval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapäevaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki täiuslik. Keele automaattöötluse kõige rohkem kasutust leidev rakendus on ilmselt masintõlge. Ikka ja jälle jagatakse sotsiaalmeedias, kuidas tuntud süsteemid (näiteks Google Translate) midagi valesti tõlgivad. Enamasti tekitavad absurdse olukorra mitmest sõnast koosnevad fraasid või laused. Näiteks ei suuda tõlkesüsteemid tabada lauses „Ta läks lepinguga alt“ ühendi alt minema tähendust petta saama, sest õige tähenduse edastamiseks ei saa selle ühendi komponente sõna-sõnalt tõlkida ja seetõttu satubki arvuti hätta. Selleks et nii masintõlkesüsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse või küsimus-vastus süsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesõnalisi üksuseid ja nende eri tähendusi, mida inimesed konteksti põhjal üpriski lihtalt teha suudavad. Püsiühendite (tähenduse) automaattuvastus on oluline kõikides keeltes ja on seetõttu pälvinud arvutilingvistikas rohkelt tähelepanu. Seega on eriti inglise keele põhjal välja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele püsiühendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinõppe meetodeid, mis on teiste keelte püsiühendite tuvastamisel edukad olnud, üht liiki eesti keele püsiühendi – ühendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete põhjal, et seni eesti keele traditsioonilises käsitluses esitatud eesti keele ühendverbide jaotus ainukordseteks (ühendi komponentide koosesinemisel tekib uus tähendus) ja korrapärasteks (ühendi tähendus on tema komponentide summa) ei ole piisavalt põhjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et püsiühendid (k.a ühendverbid) jaotuvad skaalale, mille ühes otsas on ühendid, mille tähendus on selgelt komponentide tähenduste summa. ja teises need ühendid, mis saavad uue tähenduse. Uurimus näitab, et lisaks kontekstile aitavad arvutil tuvastada ühendverbi õiget tähendust mitmed teised tunnuseid, näiteks subjekti ja objekti elusus ja käänded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta läks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapärane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S
    corecore