193 research outputs found

    Eesti keele ühendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    Tänapäeval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapäevaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki täiuslik. Keele automaattöötluse kõige rohkem kasutust leidev rakendus on ilmselt masintõlge. Ikka ja jälle jagatakse sotsiaalmeedias, kuidas tuntud süsteemid (näiteks Google Translate) midagi valesti tõlgivad. Enamasti tekitavad absurdse olukorra mitmest sõnast koosnevad fraasid või laused. Näiteks ei suuda tõlkesüsteemid tabada lauses „Ta läks lepinguga alt“ ühendi alt minema tähendust petta saama, sest õige tähenduse edastamiseks ei saa selle ühendi komponente sõna-sõnalt tõlkida ja seetõttu satubki arvuti hätta. Selleks et nii masintõlkesüsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse või küsimus-vastus süsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesõnalisi üksuseid ja nende eri tähendusi, mida inimesed konteksti põhjal üpriski lihtalt teha suudavad. Püsiühendite (tähenduse) automaattuvastus on oluline kõikides keeltes ja on seetõttu pälvinud arvutilingvistikas rohkelt tähelepanu. Seega on eriti inglise keele põhjal välja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele püsiühendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinõppe meetodeid, mis on teiste keelte püsiühendite tuvastamisel edukad olnud, üht liiki eesti keele püsiühendi – ühendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete põhjal, et seni eesti keele traditsioonilises käsitluses esitatud eesti keele ühendverbide jaotus ainukordseteks (ühendi komponentide koosesinemisel tekib uus tähendus) ja korrapärasteks (ühendi tähendus on tema komponentide summa) ei ole piisavalt põhjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et püsiühendid (k.a ühendverbid) jaotuvad skaalale, mille ühes otsas on ühendid, mille tähendus on selgelt komponentide tähenduste summa. ja teises need ühendid, mis saavad uue tähenduse. Uurimus näitab, et lisaks kontekstile aitavad arvutil tuvastada ühendverbi õiget tähendust mitmed teised tunnuseid, näiteks subjekti ja objekti elusus ja käänded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta läks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapärane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S

    Discovering multiword expressions

    Get PDF
    In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Uvid u automatsko izlučivanje metaforičkih kolokacija

    Get PDF
    Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su već dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju čine metaforičke kolokacije. Kod metaforičkih je kolokacija kod jedne od sastavnica došlo do semantičkoga pomaka, tj. jedna od sastavnica poprima preneseno značenje. Glavni su ciljevi ovoga rada istražiti postojeću literaturu te dati sustavan pregled postojećih istraživanja na temu izlučivanja kolokacija i postojećih metoda, mjera i resursa. Postojeća istraživanja opisana su i klasificirana prema različitim pristupima (statistički, hibridni i zasnovani na distribucijskoj semantici). Također su opisane različite asocijativne mjere i postojeći načini procjene rezultata automatskoga izlučivanja kolokacija. Metode, alati i resursi koji su korišteni u prethodnim istraživanjima, a mogli bi biti korisni za naš budući rad posebno su istaknuti. Stečeni uvidi u postojeća istraživanja čine prvi korak u razmatranju mogućnosti razvijanja postupka za automatsko izlučivanje metaforičkih kolokacija

    The role of constituents in multiword expressions

    Get PDF
    Multiword expressions (MWEs), such as noun compounds (e.g. nickname in English, and Ohrwurm in German), complex verbs (e.g. give up in English, and aufgeben in German) and idioms (e.g. break the ice in English, and das Eis brechen in German), may be interpreted literally but often undergo meaning shifts with respect to their constituents. Theoretical, psycholinguistic as well as computational linguistic research remain puzzled by when and how MWEs receive literal vs. meaning-shifted interpretations, what the contributions of the MWE constituents are to the degree of semantic transparency (i.e., meaning compositionality) of the MWE, and how literal vs. meaning-shifted MWEs are processed and computed. This edited volume presents an interdisciplinary selection of seven papers on recent findings across linguistic, psycholinguistic, corpus-based and computational research fields and perspectives, discussing the interaction of constituent properties and MWE meanings, and how MWE constituents contribute to the processing and representation of MWEs. The collection is based on a workshop at the 2017 annual conference of the German Linguistic Society (DGfS) that took place at Saarland University in Saarbrücken, Germany

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    An interdisciplinary, cross-lingual perspective

    Get PDF
    Multiword expressions (MWEs), such as noun compounds (e.g. nickname in English, and Ohrwurm in German), complex verbs (e.g. give up in English, and aufgeben in German) and idioms (e.g. break the ice in English, and das Eis brechen in German), may be interpreted literally but often undergo meaning shifts with respect to their constituents. Theoretical, psycholinguistic as well as computational linguistic research remain puzzled by when and how MWEs receive literal vs. meaning-shifted interpretations, what the contributions of the MWE constituents are to the degree of semantic transparency (i.e., meaning compositionality) of the MWE, and how literal vs. meaning-shifted MWEs are processed and computed. This edited volume presents an interdisciplinary selection of seven papers on recent findings across linguistic, psycholinguistic, corpus-based and computational research fields and perspectives, discussing the interaction of constituent properties and MWE meanings, and how MWE constituents contribute to the processing and representation of MWEs. The collection is based on a workshop at the 2017 annual conference of the German Linguistic Society (DGfS) that took place at Saarland University in Saarbrücken, German

    The role of constituents in multiword expressions

    Get PDF
    Multiword expressions (MWEs), such as noun compounds (e.g. nickname in English, and Ohrwurm in German), complex verbs (e.g. give up in English, and aufgeben in German) and idioms (e.g. break the ice in English, and das Eis brechen in German), may be interpreted literally but often undergo meaning shifts with respect to their constituents. Theoretical, psycholinguistic as well as computational linguistic research remain puzzled by when and how MWEs receive literal vs. meaning-shifted interpretations, what the contributions of the MWE constituents are to the degree of semantic transparency (i.e., meaning compositionality) of the MWE, and how literal vs. meaning-shifted MWEs are processed and computed. This edited volume presents an interdisciplinary selection of seven papers on recent findings across linguistic, psycholinguistic, corpus-based and computational research fields and perspectives, discussing the interaction of constituent properties and MWE meanings, and how MWE constituents contribute to the processing and representation of MWEs. The collection is based on a workshop at the 2017 annual conference of the German Linguistic Society (DGfS) that took place at Saarland University in Saarbrücken, Germany
    corecore