99 research outputs found
Firearms and Tigers are Dangerous, Kitchen Knives and Zebras are Not: Testing whether Word Embeddings Can Tell
This paper presents an approach for investigating the nature of semantic
information captured by word embeddings. We propose a method that extends an
existing human-elicited semantic property dataset with gold negative examples
using crowd judgments. Our experimental approach tests the ability of
supervised classifiers to identify semantic features in word embedding vectors
and com- pares this to a feature-identification method based on full vector
cosine similarity. The idea behind this method is that properties identified by
classifiers, but not through full vector comparison are captured by embeddings.
Properties that cannot be identified by either method are not. Our results
provide an initial indication that semantic properties relevant for the way
entities interact (e.g. dangerous) are captured, while perceptual information
(e.g. colors) is not represented. We conclude that, though preliminary, these
results show that our method is suitable for identifying which properties are
captured by embeddings.Comment: Accepted to the EMNLP workshop "Analyzing and interpreting neural
networks for NLP
Spring Cleaning and Grammar Compression: Two Techniques for Detection of Redundancy in HPSG Grammars
Dealing with Abbreviations in the Slovenian Biographical Lexicon
Abbreviations present a significant challenge for NLP systems because they
cause tokenization and out-of-vocabulary errors. They can also make the text
less readable, especially in reference printed books, where they are
extensively used. Abbreviations are especially problematic in low-resource
settings, where systems are less robust to begin with. In this paper, we
propose a new method for addressing the problems caused by a high density of
domain-specific abbreviations in a text. We apply this method to the case of a
Slovenian biographical lexicon and evaluate it on a newly developed
gold-standard dataset of 51 Slovenian biographies. Our abbreviation
identification method performs significantly better than commonly used ad-hoc
solutions, especially at identifying unseen abbreviations. We also propose and
present the results of a method for expanding the identified abbreviations in
context.Comment: To be presented at The 2022 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2022
Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods
Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method--method and method--human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation
Finding Stories in 1,784,532 Events: Scaling Up Computational Models of Narrative
Information professionals face the challenge of making sense of an ever increasing amount of information. Storylines can provide a useful way to present relevant information because they reveal explanatory relations between events. In this position paper, we present and discuss the four main challenges that make it difficult to get to these stories and our first ideas on how to start resolving them
Large-scale Cross-lingual Language Resources for Referencing and Framing
In this article, we lay out the basic ideas and principles of the project Framing Situations in the Dutch Language. We provide our first results of data acquisition, together with the first data release. We introduce the notion of cross-lingual referential corpora. These corpora consist of texts that make reference to exactly the same incidents. The referential grounding allows us to analyze the framing of these incidents in different languages and across different texts. During the project, we will use the automatically generated data to study linguistic framing as a phenomenon, build framing resources such as lexicons and corpora. We expect to capture larger variation in framing compared to traditional approaches for building such resources. Our first data release, which contains structured data about a large number of incidents and reference texts, can be found at http://dutchframenet. nl/data-releases/
A larger-scale evaluation resource of terms and their shift direction for diachronic lexical semantics
Determining how words have changed their meaning is an important topic in Natural Language Processing. However, evaluations of methods to characterise such change have been limited to small, handcrafted resources. We introduce an English evaluation set which is larger, more varied, and more realistic than seen to date, with terms derived from a historical thesaurus. Moreover, the dataset is unique in that it represents change as a shift from the term of interest to a WordNet synset. Using the synset lemmas, we can use this set to evaluate (standard) methods that detect change between word pairs, as well as (adapted) methods that detect the change between a term and a sense overall. We show that performance on the new data set is much lower than earlier reported findings, setting a new standard
- …