301 research outputs found

    Non-Compositional Term Dependence for Information Retrieval

    Full text link
    Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection [21] to IR. Our approach, integrated into ranking using Markov Random Fields [31], yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR

    A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection

    Full text link
    Compositionality in language refers to how much the meaning of some phrase can be decomposed into the meaning of its constituents and the way these constituents are combined. Based on the premise that substitution by synonyms is meaning-preserving, compositionality can be approximated as the semantic similarity between a phrase and a version of that phrase where words have been replaced by their synonyms. Different ways of representing such phrases exist (e.g., vectors [1] or language models [2]), and the choice of representation affects the measurement of semantic similarity. We propose a new compositionality detection method that represents phrases as ranked lists of term weights. Our method approximates the semantic similarity between two ranked list representations using a range of well-known distance and correlation metrics. In contrast to most state-of-the-art approaches in compositionality detection, our method is completely unsupervised. Experiments with a publicly available dataset of 1048 human-annotated phrases shows that, compared to strong supervised baselines, our approach provides superior measurement of compositionality using any of the distance and correlation metrics considered

    Towards Phrase-based Unsupervised Machine Translation: Phrase Representations

    Get PDF
    Olemasolevad juhendamata masintõlke lähenemised saavutavad küll lootusrikkaid tulemusi, mis on aga halvemad kui juhendatud masintõlke meetodite puhul. Käesolev töö arendab uut fraaside tasemel töötavat lähenemist juhendamata masintõlkele: kuna praegused sõna-põhised lähenemised masintõlkele kasutavad sõnade vektoresitusi, siis selle uue lähenemise juures on vaja vastavaid vektoresitusi fraasidele. Neid esitusi on vaja õppida juhendamatul viisil, arvestades ka fraasidele spetsiifilisi eripärasid nagu mitmesõnalisi väljendeid, ning lisaks peab esituste vektorruum rahuldama teatud nõudeid, et juhendamata masintõlke töötaks. Antud töö defineerib fraasiesituste effektiivsust juhendamata masintõlke kontekstis, loob juhendamata kompositsionaalse modelleerimise raamistiku fraasidele, ning näitab kuidas raamistiku kasutades jõuda effektiivsete fraasiesitusteni.Arendatud skriptid ja treenitud mudelid on jagatud avatud lähtekoodi projektina.Current unsupervised machine translation models despite achieving promising results work quite modestly comparing to the supervised approaches. This work aims to make an important step towards a new research direction of Phrase-based Unsupervised Machine Translation. Since current word-based models rely on representation of words, phrasebased models require appropriate phrase representations. These representations should be learned without supervision, address phrase specific multiword expressions issues, and their embedding space has to follow certain regulations for unsupervised translationto perform reasonable. We specify what makes phrase representations effective in terms of unsupervised machine translation, define unsupervised compositional modeling framework for phrases, and show how to use this framework to satisfy to the proposed requirements thus obtaining effective representations for phrases. We make the code and trained models publicly available as an open source project

    Handling non-compositionality in multilingual CNLs

    Full text link
    In this paper, we describe methods for handling multilingual non-compositional constructions in the framework of GF. We specifically look at methods to detect and extract non-compositional phrases from parallel texts and propose methods to handle such constructions in GF grammars. We expect that the methods to handle non-compositional constructions will enrich CNLs by providing more flexibility in the design of controlled languages. We look at two specific use cases of non-compositional constructions: a general-purpose method to detect and extract multilingual multiword expressions and a procedure to identify nominal compounds in German. We evaluate our procedure for multiword expressions by performing a qualitative analysis of the results. For the experiments on nominal compounds, we incorporate the detected compounds in a full SMT pipeline and evaluate the impact of our method in machine translation process.Comment: CNL workshop in COLING 201

    Eesti keele ühendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    Tänapäeval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapäevaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki täiuslik. Keele automaattöötluse kõige rohkem kasutust leidev rakendus on ilmselt masintõlge. Ikka ja jälle jagatakse sotsiaalmeedias, kuidas tuntud süsteemid (näiteks Google Translate) midagi valesti tõlgivad. Enamasti tekitavad absurdse olukorra mitmest sõnast koosnevad fraasid või laused. Näiteks ei suuda tõlkesüsteemid tabada lauses „Ta läks lepinguga alt“ ühendi alt minema tähendust petta saama, sest õige tähenduse edastamiseks ei saa selle ühendi komponente sõna-sõnalt tõlkida ja seetõttu satubki arvuti hätta. Selleks et nii masintõlkesüsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse või küsimus-vastus süsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesõnalisi üksuseid ja nende eri tähendusi, mida inimesed konteksti põhjal üpriski lihtalt teha suudavad. Püsiühendite (tähenduse) automaattuvastus on oluline kõikides keeltes ja on seetõttu pälvinud arvutilingvistikas rohkelt tähelepanu. Seega on eriti inglise keele põhjal välja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele püsiühendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinõppe meetodeid, mis on teiste keelte püsiühendite tuvastamisel edukad olnud, üht liiki eesti keele püsiühendi – ühendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete põhjal, et seni eesti keele traditsioonilises käsitluses esitatud eesti keele ühendverbide jaotus ainukordseteks (ühendi komponentide koosesinemisel tekib uus tähendus) ja korrapärasteks (ühendi tähendus on tema komponentide summa) ei ole piisavalt põhjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et püsiühendid (k.a ühendverbid) jaotuvad skaalale, mille ühes otsas on ühendid, mille tähendus on selgelt komponentide tähenduste summa. ja teises need ühendid, mis saavad uue tähenduse. Uurimus näitab, et lisaks kontekstile aitavad arvutil tuvastada ühendverbi õiget tähendust mitmed teised tunnuseid, näiteks subjekti ja objekti elusus ja käänded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta läks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapärane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S

    Contextual compositionality detection with external knowledge bases and word embeddings

    Get PDF
    When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multiword phrases is critical in any application of semantic processing, such as search engines [9]; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase \ufffdgreen card\ufffd is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality1, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality
    corecore