286 research outputs found

    Probing with Noise: Unpicking the Warp and Weft of Taxonomic and Thematic Meaning Representations in Static and Contextual Embeddings

    Get PDF
    The semantic relatedness of words has two key dimensions: it can be based on taxonomic information or thematic, co-occurrence-based information. These are captured by different language resources—taxonomies and natural corpora—from which we can build different computational meaning representations that are able to reflect these relationships. Vector representations are arguably the most popular meaning representations in NLP, encoding information in a shared multidimensional semantic space and allowing for distances between points to reflect relatedness between items that populate the space. Improving our understanding of how different types of linguistic information are encoded in vector space can provide valuable insights to the field of model interpretability and can further our understanding of different encoder architectures. Alongside vector dimensions, we argue that information can be encoded in more implicit ways and hypothesise that it is possible for the vector magnitude—the norm—to also carry linguistic information. We develop a method to test this hypothesis and provide a systematic exploration of the role of the vector norm in encoding the different axes of semantic relatedness across a variety of vector representations, including taxonomic, thematic, static and contextual embeddings. The method is an extension of the standard probing framework and allows for relative intrinsic interpretations of probing results. It relies on introducing targeted noise that ablates information encoded in embeddings and is grounded by solid baselines and confidence intervals. We call the method probing with noise and test the method at both the word and sentence level, on a host of established linguistic probing tasks, as well as two new semantic probing tasks: hypernymy and idiomatic usage detection. Our experiments show that the method is able to provide geometric insights into embeddings and can demonstrate whether the norm encodes the linguistic information being probed for. This confirms the existence of separate information containers in English word2vec, GloVe and BERT embeddings. The experiments and complementary analyses show that different encoders encode different kinds of linguistic information in the norm: taxonomic vectors store hypernym-hyponym information in the norm, while non-taxonomic vectors do not. Meanwhile, non-taxonomic GloVe embeddings encode syntactic and sentence length information in the vector norm, while the contextual BERT encodes contextual incongruity. Our method can thus reveal where in the embeddings certain information is contained. Furthermore, it can be supplemented by an array of post-hoc analyses that reveal how information is encoded as well, thus offering valuable structural and geometric insights into the different types of embeddings

    Extending persian sentiment lexicon with idiomatic expressions for sentiment analysis

    Get PDF
    Nowadays, it is important for buyers to know other customer opinions to make informed decisions on buying a product or service. In addition, companies and organizations can exploit customer opinions to improve their products and services. However, the Quintilian bytes of the opinions generated every day cannot be manually read and summarized. Sentiment analysis and opinion mining techniques offer a solution to automatically classify and summarize user opinions. However, current sentiment analysis research is mostly focused on English, with much fewer resources available for other languages like Persian. In our previous work, we developed PerSent, a publicly available sentiment lexicon to facilitate lexicon-based sentiment analysis of texts in the Persian language. However, PerSent-based sentiment analysis approach fails to classify the real-world sentences consisting of idiomatic expressions. Therefore, in this paper, we describe an extension of the PerSent lexicon with more than 1000 idiomatic expressions, along with their polarity, and propose an algorithm to accurately classify Persian text. Comparative experimental results reveal the usefulness of the extended lexicon for sentiment analysis as compared to PerSent lexicon-based sentiment analysis as well as Persian-to-English translation-based approaches. The extended version of the lexicon will be made publicly available

    Representations of Idioms for Natural Language Processing: Idiom type and token identification, Language Modelling and Neural Machine Translation

    Get PDF
    An idiom is a multiword expression (MWE) whose meaning is non- compositional, i.e., the meaning of the expression is different from the meaning of its individual components. Idioms are complex construc- tions of language used creatively across almost all text genres. Idioms pose problems to natural language processing (NLP) systems due to their non-compositional nature, and the correct processing of idioms can improve a wide range of NLP systems. Current approaches to idiom processing vary in terms of the amount of discourse history required to extract the features necessary to build representations for the expressions. These features are, in general, stat- istics extracted from the text and often fail to capture all the nuances involved in idiom usage. We argue in this thesis that a more flexible representations must be used to process idioms in a range of idiom related tasks. We demonstrate that high-dimensional representations allow idiom classifiers to better model the interactions between global and local features and thereby improve the performance of these systems with regard to processing idioms. In support of this thesis we demonstrate that distributed representations of sentences, such as those generated by a Recurrent Neural Network (RNN) greatly reduce the amount of discourse history required to process idioms and that by using those representations a “general” classifier, that can take any expression as input and classify it as either an idiomatic or literal usage, is feasible. We also propose and evaluate a novel technique to add an attention module to a language model in order to bring forward past information in a RNN-based Language Model (RNN-LM). The results of our evaluation experiments demonstrate that this attention module increases the performance of such models in terms of the perplexity achieved when processing idioms. Our analysis also shows that it improves the performance of RNN-LMs on literal language and, at the same time, helps to bridge long-distance dependencies and reduce the number of parameters required in RNN-LMs to achieve state-of-the-art performance. We investigate the adaptation of this novel RNN-LM to Neural Machine Translation (NMT) systems and we show that, despite the mixed results, it improves the translation of idioms into languages that require distant reordering such as German. We also show that these models are suited to small corpora for in-domain translations for language pairs such as English/Brazilian-Portuguese

    A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions

    Get PDF
    In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems

    Function similarity using family context

    Get PDF
    Finding changed and similar functions between a pair of binaries is an important problem in malware attribution and for the identification of new malware capabilities. This paper presents a new technique called Function Similarity using Family Context (FSFC) for this problem. FSFC trains a Support Vector Machine (SVM) model using pairs of similar functions from two program variants. This method improves upon previous research called Cross Version Contextual Function Similarity (CVCFS) e epresenting a function using features extracted not just from the function itself, but also, from other functions with which it has a caller and callee relationship. We present the results of an initial experiment that shows that the use of additional features from the context of a function significantly decreases the false positive rate, obviating the need for a separate pass for cleaning false positives. The more surprising and unexpected finding is that the SVM model produced by FSFC can abstract function similarity features from one pair of program variants to find similar functions in an unrelated pair of program variants. If validated by a larger study, this new property leads to the possibility of creating generic similar function classifiers that can be packaged and distributed in reverse engineering tools such as IDA Pro and Ghidra.This research was performed in the Internet Commerce Security Lab (ICSL), which is a joint venture with research partners Westpac, IBM, and Federation University Australia

    Computational modeling of lexical ambiguity

    Get PDF
    Lexical ambiguity is a frequent phenomenon that can occur not only for words but also on the phrase level. Natural language processing systems need to efficiently deal with these ambiguities in various tasks, however, we often encounter such system failures in real applications. This thesis studies several complex phenomena related to word/phrase ambiguity at the level of text and proposes computational models to tackle these phenomena. Throughout the thesis, we address a number of lexical ambiguity phenomena varying across the sense granularity line. We start with the idiom detection task, in which candidate senses are constrained toliteral\u27 and idiomatic\u27. Then, we move on to the more general case of detecting figurative expressions. In this task, target phrases are not lexicalized but rather bear nonliteral semantic meanings. Similar to the idiom task, this one has two candidate sense categories (literal\u27 and nonliteral\u27). Next, we consider a more complicated situation where words often have more than two candidate senses and the sense boundaries are fuzzier, namely word sense disambiguation (WSD). Finally, we discuss another lexical ambiguity problem in which the sense inventory is not explicitly specified, word sense induction (WSI).Computationally, we propose novel models that outperform state-of-the-art systems. We start with a supervised model in which we study a number of semantic relatedness features combined with linguistically informed features such as local/global context, part-of-speech tags, syntactic structure, named entities and sentence markers. While experimental results show that the supervised model can effectively detect idiomatic expressions, we further improve the work by proposing an unsupervised bootstrapping model which does not rely on human annotated data but performs at a comparative level to the supervised model. Moving on to accommodate other lexical ambiguity phenomena, we propose a Gaussian Mixture Model that can be used not only for detecting idiomatic expressions but also for extracting unlexicalized figurative expressions from raw corpora automatically. Aiming at modeling multiple sense disambiguation tasks within a uniform framework, we propose a probabilistic model (topic model), which encodes human knowledge as sense priors via paraphrases of gold-standard sense inventories, to effectively perform on the idiom task as well as two WSD tasks. Dealing with WSI, we find state-of-the-art WSI research is hindered by the deficiencies of evaluation measures that are in favor of either very fine-grained or very coarse-grained cluster output. We argue that the information theoretic V-Measure is a promising approach to pursue in the future but should be based on more precise entropy estimators, supported by evidence from the entropy bias analysis, simulation experiments, and stochastic predictions. We evaluate all our proposed models against state-of-the-art systems on standard test data sets, and we show that our approaches advance the state-of-the-art.Lexikalische Mehrdeutigkeit ist ein häufiges Phänomen, das nicht nur auf Wort, sondern auch auf phrasaler Ebene auftreten kann. Systeme zur Verarbeitung natürlicher Sprache müssen diese Mehrdeutigkeiten in verschiedenen Aufgaben effizient bewältigen, doch in realen Anwendungen erweisen sich solche Systeme oft als fehlerhaft. Ziel dieser Dissertation ist es verschiedene komplexe Phänomene lexikalischer und insbesondere phrasaler Mehrdeutigkeit zu erforschen und algorithmische Modelle zur Verarbeitung dieser Phänomene vorzuschlagen. In dieser Dissertation beschäftigen wir uns durchgehend mit einer Reihe von Phänomenen lexikalischer Ambiguität, die in der Granularität der Sinnunterschiede variieren: Wir beginnen mit der Aufgabe Redewendungen zu erkennen, in der die möglichen Bedeutungen auf wörtlich\u27 und idiomatisch\u27 beschränkt sind; dann fahren wir mit einem allgemeineren Fall fort in dem die Zielphrasen keine feststehenden Redewendungen sind, aber im Kontext eine übertragene Bedeutung haben. Wir definieren hier die Aufgabe bildhafte Ausdrücke zu erkennen als Disambiguierungs-Problem in der es, ähnlich wie in der Redewendungs-Aufgabe, zwei mögliche Bedeutungskategorien gibt (wörtlich\u27 und nicht-wörtlich\u27). Als nächstes betrachten wir eine kompliziertere Situation, in der Wörter oft mehr als zwei mögliche Bedeutungen haben und die Grenzen zwischen diesen Sinnen unschärfer sind, nämlich Wort-Bedeutungs-Unterscheidung (textit{Word Sense Disambiguation}, WSD); Schließlich diskutieren wir ein weiteres Problem lexikalischer Mehrdeutigkeit, in dem das Bedeutungsinventar nicht bereits ausdrücklich gegeben ist, d.h. Wort-Bedeutungs-Induktion (Word Sense Induction, WSI). Auf algorithmischer Seite schlagen wir Modelle vor, die Systeme auf dem aktuellen Stand der Technik übertreffen. Wir beginnen mit einem überwachten Modell, in dem wir eine Reihe von Merkmalen basierend auf semantischer Ähnlichkeit mit linguistisch fundierten Merkmalen wie lokalem/globalem Kontext, Wortarten, syntaktischer Struktur, Eigennamen und Satzzeichen kombinieren. Ausgehend von experimentellen Ergebnissen die zeigen, dass das überwachte Modell effektiv idiomatische Ausdrücke erkennen kann, verbessern wir unsere Arbeit indem wir ein unüberwachtes Bootstrapping-Modell präsentieren, das nicht auf manuell annotierte Daten angewiesen ist aber ähnlich gut funktioniert wie das überwachte Modell. Um weitere Phänomene lexikalischer Mehrdeutigkeit zu behandeln, schlagen wir des weiteren ein Gauss\u27sches Mischmodell vor, das nicht nur zur Erkennung von Redewendungen verwendet werden kann, sondern auch dazu effektiv und automatisch neue produktive bildhafte Ausdrücke aus unverarbeiteten Corpora zu extrahieren. Mit dem Ziel mehrere Aufgaben zur Disambiguierung innerhalb eines einheitlichen Systems zu modellieren, schlagen wir ein statistisches Modell (Topic-Modell) vor, um sowohl die Aufgabestellung der Redewendungs-Erkennung als auch die WSD-Probleme effektiv zu bearbeiten. Die A-priori-Wahrscheinlichkeiten dieses Modells kodieren menschliches Wissen, wozu es Gold-Standard-Bedeutungslexika benutzt. Bezüglich WSI stellen wir fest, dass der Stand der WSI-Forschung durch inadequate Evaluationsmaße behindert wird, die entweder sehr feinkörnige oder sehr grobkörnige Cluster-Ergebnisse bevorzugen. Wir behaupten, dass das Informationstheoretische V-Measure\u27 ein vielversprechender Ansatz ist, der zukünftig verfolgt werden könnte, der jedoch mit präzieseren Entropie Schätzern, unterstützt von Belegen aus der Entropie-Trend-Analyse, Simulationxexperimenten und stochastische Vorhersagen, aufbauen sollte. Wir evaluieren alle unsere vorgeschlagenen Modelle auf standardisierten Testdaten und vergleichen sie mit anderen Systemen auf dem aktuellen Forschungsstand, und wir zeigen dass unsere Ansätze den aktuellen Forschungsstand voranbringen

    Toward Robust and Efficient Interpretations of Idiomatic Expressions in Context

    Get PDF
    Studies show that a large number of idioms can be interpreted figuratively or literally depending on their contexts. This usage ambiguity has negative impacts on many natural language processing (NLP) applications. In this thesis, we investigate methods of building robust and efficient usage recognizers by modeling interactions between contexts and idioms. We aim to address three problems. First, how do differences in idioms’ linguistic properties affect the performances of automatic usage recognizers? We analyze the interactions between context representations and linguistic properties of idioms and develop ensemble models that predict usages adaptively for different idioms. Second, can an automatic usage recognizer be developed without annotated training examples? We develop a method for estimating the semantic distance between context and components of an idiom and then use that as distant supervision to guide further unsupervised clustering of usages. Third, how can we build one generalized model that reliably predicts the correct usage for a wide range of idioms, despite of variations in their linguistic properties? We recast this as a problem of modeling semantic compatibility between the literal interpretation of an arbitrary idiom and its context. We show that a general model of semantic compatibility can be trained from a large unannotated corpus, and that the resulting model can be applied to an arbitrary idiom without specific parameter tuning. To demonstrate that our work can benefit downstream NLP applications, we perform a case study on machine translation. It shows that our model can help to improve the translation quality of sentences containing idioms

    Extração de combinações lexicais restritas pela deteção da não composionalidade de expressões pluriverbais

    Get PDF
    In this article an evaluation of a method for extracting restricted lexical combinations from parallel corpora by detecting non-compositionality of multiword expressions in translation will be presented. This method presupposes that by finding sequences of words whose translation does not follow a simple word-to-word conversion of the component words, a collocation is probably present. Word bigrams are used.Neste artigo apresentamos uma avaliação sobre um método para extrair combinações lexicais restritas a partir de corpora paralelos, pela deteção da não composicionalidade de expressões pluriverbais na tradução. Este método baseia-se na presunção de que, encontrando sequências de palavras cuja tradução não siga a tradução palavra por palavra dos seus componentes, é provável estar-se perante uma colocação. São usadas palavras brigrama.info:eu-repo/semantics/publishedVersio
    corecore