Search CORE

22 research outputs found

Correlation-based Intrinsic Evaluation of Word Vector Representations

Author: Dyer Chris
Faruqui Manaal
Tsvetkov Yulia
Publication venue
Publication date: 01/01/2016
Field of study

We introduce QVEC-CCA--an intrinsic evaluation metric for word vector representations based on correlations of learned vectors with features extracted from linguistic resources. We show that QVEC-CCA scores are an effective proxy for a range of extrinsic semantic and syntactic tasks. We also show that the proposed evaluation obtains higher and more consistent correlations with downstream tasks, compared to existing approaches to intrinsic evaluation of word vectors that are based on word similarity.Comment: RepEval 2016, 5 page

arXiv.org e-Print Archive

Crossref

Improving Negative Sampling for Word Representation using Self-embedded Features

Author: Baroni Marco
Glorot Xavier
Gutmann Michael
Le Quoc
Levy Omer
Mikolov Tomas
Mnih Andriy
Mnih Andriy
Morin Frederic
Shrivastava Anshumali
Tang Duyu
Tian Fei
Vincent Pascal
Zhou Guangyou
Publication venue
Publication date: 01/01/2018
Field of study

Although the word-popularity based negative sampler has shown superb performance in the skip-gram model, the theoretical motivation behind oversampling popular (non-observed) words as negative samples is still not well understood. In this paper, we start from an investigation of the gradient vanishing issue in the skipgram model without a proper negative sampler. By performing an insightful analysis from the stochastic gradient descent (SGD) learning perspective, we demonstrate that, both theoretically and intuitively, negative samples with larger inner product scores are more informative than those with lower scores for the SGD learner in terms of both convergence rate and accuracy. Understanding this, we propose an alternative sampling algorithm that dynamically selects informative negative samples during each SGD update. More importantly, the proposed sampler accounts for multi-dimensional self-embedded features during the sampling process, which essentially makes it more effective than the original popularity-based (one-dimensional) sampler. Empirical experiments further verify our observations, and show that our fine-grained samplers gain significant improvement over the existing ones without increasing computational complexity.Comment: Accepted in WSDM 201

arXiv.org e-Print Archive

Crossref

Enlighten

SemR-11: A Multi-Lingual Gold-Standard for Semantic Similarity and Relatedness for Eleven Languages

Author: Barzegar Siamak
Davis Brian
Freitas Andre
Handschuh Siegfried
Zarrouk Manel
Publication venue: European Language Resources Association
Publication date: 01/01/2018
Field of study

This work describes SemR-11, a multi-lingual dataset for evaluating semantic similarity and relatedness for 11 languages (German, French, Russian, Italian, Dutch, Chinese, Portuguese, Swedish, Spanish, Arabic and Persian). Semantic similarity and relatedness gold standards have been initially used to support the evaluation of semantic distance measures in the context of linguistic and knowledge resources and distributional semantic models. SemR-11 builds upon the English gold-standards of Miller & Charles (MC), Rubenstein & Goodenough (RG), WordSimilarity 353 (WS-353), and Simlex-999, providing a canonical translation for them. The final dataset consists of 15,917 word pairs and can be used to support the construction and evaluation of semantic similarity/relatedness and distributional semantic models. As a case study, the SemR-11 test collections was used to investigate how different distributional semantic models built from corpora in different languages and with different sizes perform in computing semantic relatedness similarity and relatedness tasks

MURAL - Maynooth University Research Archive Library