8,851 research outputs found
Efficient distributed representations beyond negative sampling
This article describes an efficient method to learn distributed
representations, also known as embeddings. This is accomplished minimizing an
objective function similar to the one introduced in the Word2Vec algorithm and
later adopted in several works. The optimization computational bottleneck is
the calculation of the softmax normalization constants for which a number of
operations scaling quadratically with the sample size is required. This
complexity is unsuited for large datasets and negative sampling is a popular
workaround, allowing one to obtain distributed representations in linear time
with respect to the sample size. Negative sampling consists, however, in a
change of the loss function and hence solves a different optimization problem
from the one originally proposed. Our contribution is to show that the sotfmax
normalization constants can be estimated in linear time, allowing us to design
an efficient optimization strategy to learn distributed representations. We
test our approximation on two popular applications related to word and node
embeddings. The results evidence competing performance in terms of accuracy
with respect to negative sampling with a remarkably lower computational time
Asynchronous Training of Word Embeddings for Large Text Corpora
Word embeddings are a powerful approach for analyzing language and have been
widely popular in numerous tasks in information retrieval and text mining.
Training embeddings over huge corpora is computationally expensive because the
input is typically sequentially processed and parameters are synchronously
updated. Distributed architectures for asynchronous training that have been
proposed either focus on scaling vocabulary sizes and dimensionality or suffer
from expensive synchronization latencies.
In this paper, we propose a scalable approach to train word embeddings by
partitioning the input space instead in order to scale to massive text corpora
while not sacrificing the performance of the embeddings. Our training procedure
does not involve any parameter synchronization except a final sub-model merge
phase that typically executes in a few minutes. Our distributed training scales
seamlessly to large corpus sizes and we get comparable and sometimes even up to
45% performance improvement in a variety of NLP benchmarks using models trained
by our distributed procedure which requires of the time taken by the
baseline approach. Finally we also show that we are robust to missing words in
sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
Word Embeddings: A Survey
This work lists and describes the main recent strategies for building
fixed-length, dense and distributed representations for words, based on the
distributional hypothesis. These representations are now commonly called word
embeddings and, in addition to encoding surprisingly good syntactic and
semantic information, have been proven useful as extra features in many
downstream NLP tasks.Comment: 10 pages, 2 tables, 1 imag
Riemannian Optimization for Skip-Gram Negative Sampling
Skip-Gram Negative Sampling (SGNS) word embedding model, well known by its
implementation in "word2vec" software, is usually optimized by stochastic
gradient descent. However, the optimization of SGNS objective can be viewed as
a problem of searching for a good matrix with the low-rank constraint. The most
standard way to solve this type of problems is to apply Riemannian optimization
framework to optimize the SGNS objective over the manifold of required low-rank
matrices. In this paper, we propose an algorithm that optimizes SGNS objective
using Riemannian optimization and demonstrates its superiority over popular
competitors, such as the original method to train SGNS and SVD over SPPMI
matrix.Comment: 9 pages, 4 figures, ACL 201
PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks
Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector,
have been attracting increasing attention due to their simplicity, scalability,
and effectiveness. However, comparing to sophisticated deep learning
architectures such as convolutional neural networks, these methods usually
yield inferior results when applied to particular machine learning tasks. One
possible reason is that these text embedding methods learn the representation
of text in a fully unsupervised way, without leveraging the labeled information
available for the task. Although the low dimensional representations learned
are applicable to many different tasks, they are not particularly tuned for any
task. In this paper, we fill this gap by proposing a semi-supervised
representation learning method for text data, which we call the
\textit{predictive text embedding} (PTE). Predictive text embedding utilizes
both labeled and unlabeled data to learn the embedding of text. The labeled
information and different levels of word co-occurrence information are first
represented as a large-scale heterogeneous text network, which is then embedded
into a low dimensional space through a principled and efficient algorithm. This
low dimensional embedding not only preserves the semantic closeness of words
and documents, but also has a strong predictive power for the particular task.
Compared to recent supervised approaches based on convolutional neural
networks, predictive text embedding is comparable or more effective, much more
efficient, and has fewer parameters to tune.Comment: KDD 201
- …