43 research outputs found
Norm of Word Embedding Encodes Information Gain
Distributed representations of words encode lexical semantic information, but
what type of information is encoded and how? Focusing on the skip-gram with
negative-sampling method, we found that the squared norm of static word
embedding encodes the information gain conveyed by the word; the information
gain is defined by the Kullback-Leibler divergence of the co-occurrence
distribution of the word to the unigram distribution. Our findings are
explained by the theoretical framework of the exponential family of probability
distributions and confirmed through precise experiments that remove spurious
correlations arising from word frequency. This theory also extends to
contextualized word embeddings in language models or any neural networks with
the softmax output layer. We also demonstrate that both the KL divergence and
the squared norm of embedding provide a useful metric of the informativeness of
a word in tasks such as keyword extraction, proper-noun discrimination, and
hypernym discrimination.Comment: 23 pages, EMNLP 202
Discovering Universal Geometry in Embeddings with ICA
This study utilizes Independent Component Analysis (ICA) to unveil a
consistent semantic structure within embeddings of words or images. Our
approach extracts independent semantic components from the embeddings of a
pre-trained model by leveraging anisotropic information that remains after the
whitening process in Principal Component Analysis (PCA). We demonstrate that
each embedding can be expressed as a composition of a few intrinsic
interpretable axes and that these semantic axes remain consistent across
different languages, algorithms, and modalities. The discovery of a universal
semantic structure in the geometric patterns of embeddings enhances our
understanding of the representations in embeddings.Comment: 29 pages, EMNLP 202