3 research outputs found

    Probing with Noise: Unpicking the Warp and Weft of Taxonomic and Thematic Meaning Representations in Static and Contextual Embeddings

    Get PDF
    The semantic relatedness of words has two key dimensions: it can be based on taxonomic information or thematic, co-occurrence-based information. These are captured by different language resources—taxonomies and natural corpora—from which we can build different computational meaning representations that are able to reflect these relationships. Vector representations are arguably the most popular meaning representations in NLP, encoding information in a shared multidimensional semantic space and allowing for distances between points to reflect relatedness between items that populate the space. Improving our understanding of how different types of linguistic information are encoded in vector space can provide valuable insights to the field of model interpretability and can further our understanding of different encoder architectures. Alongside vector dimensions, we argue that information can be encoded in more implicit ways and hypothesise that it is possible for the vector magnitude—the norm—to also carry linguistic information. We develop a method to test this hypothesis and provide a systematic exploration of the role of the vector norm in encoding the different axes of semantic relatedness across a variety of vector representations, including taxonomic, thematic, static and contextual embeddings. The method is an extension of the standard probing framework and allows for relative intrinsic interpretations of probing results. It relies on introducing targeted noise that ablates information encoded in embeddings and is grounded by solid baselines and confidence intervals. We call the method probing with noise and test the method at both the word and sentence level, on a host of established linguistic probing tasks, as well as two new semantic probing tasks: hypernymy and idiomatic usage detection. Our experiments show that the method is able to provide geometric insights into embeddings and can demonstrate whether the norm encodes the linguistic information being probed for. This confirms the existence of separate information containers in English word2vec, GloVe and BERT embeddings. The experiments and complementary analyses show that different encoders encode different kinds of linguistic information in the norm: taxonomic vectors store hypernym-hyponym information in the norm, while non-taxonomic vectors do not. Meanwhile, non-taxonomic GloVe embeddings encode syntactic and sentence length information in the vector norm, while the contextual BERT encodes contextual incongruity. Our method can thus reveal where in the embeddings certain information is contained. Furthermore, it can be supplemented by an array of post-hoc analyses that reveal how information is encoded as well, thus offering valuable structural and geometric insights into the different types of embeddings

    Die Ambiguität nichtwörtlicher Bedeutung. Zur Semantik und Pragmatik der Nichtwörtlichkeitsindikatoren regelrecht und sozusagen im Deutschen

    Get PDF
    Diese Arbeit untersucht die formalsemantische Analyse nichtwörtlicher Äußerungen. Zu diesem Zweck wird zunächst ein Kriterium entwickelt, anhand dessen wörtliche und nichtwörtliche Lesarten in Korpusstudien unterschieden werden können. In einem weiteren Schritt werden im Rahmen einer Korpusstudie die Lexeme regelrecht und sozusagen auf ihre Eignung als Nichtwörtlichkeitsindikatoren überprüft. Die Ergebnisse zeigen, dass sowohl regelrecht als auch sozusagen nichtwörtliche Äußerungen vorhersagen können. Aus diesem Grund werden die beiden Lexeme formalsemantisch analysiert, um deren Beitrag zur Komposition herauszustellen. Die zentrale These ist dabei, dass der formalsemantischen Analyse nichtwörtlicher Äußerungen mehr Aufmerksamkeit geschenkt werden sollte, als dies bislang der Fall war
    corecore