34 research outputs found
Unsupervised Learning of Style-sensitive Word Vectors
This paper presents the first study aimed at capturing stylistic similarity
between words in an unsupervised manner. We propose extending the continuous
bag of words (CBOW) model (Mikolov et al., 2013) to learn style-sensitive word
vectors using a wider context window under the assumption that the style of all
the words in an utterance is consistent. In addition, we introduce a novel task
to predict lexical stylistic similarity and to create a benchmark dataset for
this task. Our experiment with this dataset supports our assumption and
demonstrates that the proposed extensions contribute to the acquisition of
style-sensitive word embeddings.Comment: 7 pages, Accepted at The 56th Annual Meeting of the Association for
Computational Linguistics (ACL 2018
Norm of Word Embedding Encodes Information Gain
Distributed representations of words encode lexical semantic information, but
what type of information is encoded and how? Focusing on the skip-gram with
negative-sampling method, we found that the squared norm of static word
embedding encodes the information gain conveyed by the word; the information
gain is defined by the Kullback-Leibler divergence of the co-occurrence
distribution of the word to the unigram distribution. Our findings are
explained by the theoretical framework of the exponential family of probability
distributions and confirmed through precise experiments that remove spurious
correlations arising from word frequency. This theory also extends to
contextualized word embeddings in language models or any neural networks with
the softmax output layer. We also demonstrate that both the KL divergence and
the squared norm of embedding provide a useful metric of the informativeness of
a word in tasks such as keyword extraction, proper-noun discrimination, and
hypernym discrimination.Comment: 23 pages, EMNLP 202
Improving word mover's distance by leveraging self-attention matrix
Measuring the semantic similarity between two sentences is still an important
task. The word mover's distance (WMD) computes the similarity via the optimal
alignment between the sets of word embeddings. However, WMD does not utilize
word order, making it challenging to distinguish sentences with significant
overlaps of similar words, even if they are semantically very different. Here,
we attempt to improve WMD by incorporating the sentence structure represented
by BERT's self-attention matrix (SAM). The proposed method is based on the
Fused Gromov-Wasserstein distance, which simultaneously considers the
similarity of the word embedding and the SAM for calculating the optimal
transport between two sentences. Experiments demonstrate the proposed method
enhances WMD and its variants in paraphrase identification with near-equivalent
performance in semantic textual similarity. Our code is available at
\url{https://github.com/ymgw55/WSMD}.Comment: 24 pages, accepted to EMNLP 2023 Finding
Beyond Vectors: Subspace Representations for Set Operations of Embeddings
In natural language processing (NLP), the role of embeddings in representing
linguistic semantics is crucial. Despite the prevalence of vector
representations in embedding sets, they exhibit limitations in expressiveness
and lack comprehensive set operations. To address this, we attempt to formulate
and apply sets and their operations within pre-trained embedding spaces.
Inspired by quantum logic, we propose to go beyond the conventional vector set
representation with our novel subspace-based approach. This methodology
constructs subspaces using pre-trained embedding sets, effectively preserving
semantic nuances previously overlooked, and consequently consistently improving
performance in downstream tasks
Transformer Language Models Handle Word Frequency in Prediction Head
Prediction head is a crucial component of Transformer language models.
Despite its direct impact on prediction, this component has often been
overlooked in analyzing Transformers. In this study, we investigate the inner
workings of the prediction head, specifically focusing on bias parameters. Our
experiments with BERT and GPT-2 models reveal that the biases in their word
prediction heads play a significant role in the models' ability to reflect word
frequency in a corpus, aligning with the logit adjustment method commonly used
in long-tailed learning. We also quantify the effect of controlling the biases
in practical auto-regressive text generation scenarios; under a particular
setting, more diverse text can be generated without compromising text quality.Comment: 11 pages, 12 figures, accepted to ACL 2023 Findings (short paper
Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words
The performance of sentence encoders can be significantly improved through
the simple practice of fine-tuning using contrastive loss. A natural question
arises: what characteristics do models acquire during contrastive learning?
This paper theoretically and experimentally shows that contrastive-based
sentence encoders implicitly weight words based on information-theoretic
quantities; that is, more informative words receive greater weight, while
others receive less. The theory states that, in the lower bound of the optimal
value of the contrastive learning objective, the norm of word embedding
reflects the information gain associated with the distribution of surrounding
words. We also conduct comprehensive experiments using various models, multiple
datasets, two methods to measure the implicit weighting of models (Integrated
Gradients and SHAP), and two information-theoretic quantities (information gain
and self-information). The results provide empirical evidence that contrastive
fine-tuning emphasizes informative words.Comment: 16 pages, 6 figures, accepted to EMNLP 2023 Findings (short paper