22 research outputs found
Correlation-based Intrinsic Evaluation of Word Vector Representations
We introduce QVEC-CCA--an intrinsic evaluation metric for word vector
representations based on correlations of learned vectors with features
extracted from linguistic resources. We show that QVEC-CCA scores are an
effective proxy for a range of extrinsic semantic and syntactic tasks. We also
show that the proposed evaluation obtains higher and more consistent
correlations with downstream tasks, compared to existing approaches to
intrinsic evaluation of word vectors that are based on word similarity.Comment: RepEval 2016, 5 page
Improving Negative Sampling for Word Representation using Self-embedded Features
Although the word-popularity based negative sampler has shown superb
performance in the skip-gram model, the theoretical motivation behind
oversampling popular (non-observed) words as negative samples is still not well
understood. In this paper, we start from an investigation of the gradient
vanishing issue in the skipgram model without a proper negative sampler. By
performing an insightful analysis from the stochastic gradient descent (SGD)
learning perspective, we demonstrate that, both theoretically and intuitively,
negative samples with larger inner product scores are more informative than
those with lower scores for the SGD learner in terms of both convergence rate
and accuracy. Understanding this, we propose an alternative sampling algorithm
that dynamically selects informative negative samples during each SGD update.
More importantly, the proposed sampler accounts for multi-dimensional
self-embedded features during the sampling process, which essentially makes it
more effective than the original popularity-based (one-dimensional) sampler.
Empirical experiments further verify our observations, and show that our
fine-grained samplers gain significant improvement over the existing ones
without increasing computational complexity.Comment: Accepted in WSDM 201
SemR-11: A Multi-Lingual Gold-Standard for Semantic Similarity and Relatedness for Eleven Languages
This work describes SemR-11, a multi-lingual dataset for evaluating semantic similarity and relatedness for 11 languages (German,
French, Russian, Italian, Dutch, Chinese, Portuguese, Swedish, Spanish, Arabic and Persian). Semantic similarity and relatedness gold
standards have been initially used to support the evaluation of semantic distance measures in the context of linguistic and knowledge
resources and distributional semantic models. SemR-11 builds upon the English gold-standards of Miller & Charles (MC), Rubenstein &
Goodenough (RG), WordSimilarity 353 (WS-353), and Simlex-999, providing a canonical translation for them. The final dataset consists
of 15,917 word pairs and can be used to support the construction and evaluation of semantic similarity/relatedness and distributional
semantic models. As a case study, the SemR-11 test collections was used to investigate how different distributional semantic models
built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness
tasks