40 research outputs found
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Analyzing Cognitive Plausibility of Subword Tokenization
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work
BiVert: Bidirectional Vocabulary Evaluation using Relations for Machine Translation
Neural machine translation (NMT) has progressed rapidly in the past few
years, promising improvements and quality translations for different languages.
Evaluation of this task is crucial to determine the quality of the translation.
Overall, insufficient emphasis is placed on the actual sense of the translation
in traditional methods. We propose a bidirectional semantic-based evaluation
method designed to assess the sense distance of the translation from the source
text. This approach employs the comprehensive multilingual encyclopedic
dictionary BabelNet. Through the calculation of the semantic distance between
the source and its back translation of the output, our method introduces a
quantifiable approach that empowers sentence comparison on the same linguistic
level. Factual analysis shows a strong correlation between the average
evaluation scores generated by our method and the human assessments across
various machine translation systems for English-German language pair. Finally,
our method proposes a new multilingual approach to rank MT systems without the
need for parallel corpora.Comment: LREC-COLING 202
Predicting Semantic Relations using Global Graph Properties
Semantic graphs, such as WordNet, are resources which curate natural language
on two distinguishable layers. On the local level, individual relations between
synsets (semantic building blocks) such as hypernymy and meronymy enhance our
understanding of the words used to express their meanings. Globally, analysis
of graph-theoretic properties of the entire net sheds light on the structure of
human language as a whole. In this paper, we combine global and local
properties of semantic graphs through the framework of Max-Margin Markov Graph
Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that
scales to large multi-relational graphs. We demonstrate how such global
modeling improves performance on the local task of predicting semantic
relations between synsets, yielding new state-of-the-art results on the WN18RR
dataset, a challenging version of WordNet link prediction in which "easy"
reciprocal cases are removed. In addition, the M3GM model identifies
multirelational motifs that are characteristic of well-formed lexical semantic
ontologies.Comment: EMNLP 201
Emptying the Ocean with a Spoon: Should We Edit Models?
We call into question the recently popularized method of direct model editing
as a means of correcting factual errors in LLM generations. We contrast model
editing with three similar but distinct approaches that pursue better defined
objectives: (1) retrieval-based architectures, which decouple factual memory
from inference and linguistic capabilities embodied in LLMs; (2) concept
erasure methods, which aim at preventing systemic bias in generated text; and
(3) attribution methods, which aim at grounding generations into identified
textual sources. We argue that direct model editing cannot be trusted as a
systematic remedy for the disadvantages inherent to LLMs, and while it has
proven potential in improving model explainability, it opens risks by
reinforcing the notion that models can be trusted for factuality. We call for
cautious promotion and application of model editing as part of the LLM
deployment process, and for responsibly limiting the use cases of LLMs to those
not relying on editing as a critical component.Comment: Findings of ACL: EMNLP 202
S\'i o no, qu\`e penses? Catalonian Independence and Linguistic Identity on Social Media
Political identity is often manifested in language variation, but the
relationship between the two is still relatively unexplored from a quantitative
perspective. This study examines the use of Catalan, a language local to the
semi-autonomous region of Catalonia in Spain, on Twitter in discourse related
to the 2017 independence referendum. We corroborate prior findings that
pro-independence tweets are more likely to include the local language than
anti-independence tweets. We also find that Catalan is used more often in
referendum-related discourse than in other contexts, contrary to prior findings
on language variation. This suggests a strong role for the Catalan language in
the expression of Catalonian political identity.Comment: NAACL 201
Will it Unblend?
Natural language processing systems often struggle with out-of-vocabulary (OOV) terms, which do not appear in training data. Blends, such as *innoventor*, are one particularly challenging class of OOV, as they are formed by fusing together two or more bases that relate to the intended meaning in unpredictable manners and degrees. In this work, we run experiments on a novel dataset of English OOV blends to quantify the difficulty of interpreting the meanings of blends by large-scale contextual language models such as BERT. We first show that BERT\u27s processing of these blends does not fully access the component meanings, leaving their contextual representations semantically impoverished. We find this is mostly due to the loss of characters resulting from blend formation. Then, we assess how easily different models can recognize the structure and recover the origin of blends, and find that context-aware embedding systems outperform character-level and context-free embeddings, although their results are still far from satisfactory
