8,851 research outputs found

    Efficient distributed representations beyond negative sampling

    Full text link
    This article describes an efficient method to learn distributed representations, also known as embeddings. This is accomplished minimizing an objective function similar to the one introduced in the Word2Vec algorithm and later adopted in several works. The optimization computational bottleneck is the calculation of the softmax normalization constants for which a number of operations scaling quadratically with the sample size is required. This complexity is unsuited for large datasets and negative sampling is a popular workaround, allowing one to obtain distributed representations in linear time with respect to the sample size. Negative sampling consists, however, in a change of the loss function and hence solves a different optimization problem from the one originally proposed. Our contribution is to show that the sotfmax normalization constants can be estimated in linear time, allowing us to design an efficient optimization strategy to learn distributed representations. We test our approximation on two popular applications related to word and node embeddings. The results evidence competing performance in terms of accuracy with respect to negative sampling with a remarkably lower computational time

    Asynchronous Training of Word Embeddings for Large Text Corpora

    Full text link
    Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is typically sequentially processed and parameters are synchronously updated. Distributed architectures for asynchronous training that have been proposed either focus on scaling vocabulary sizes and dimensionality or suffer from expensive synchronization latencies. In this paper, we propose a scalable approach to train word embeddings by partitioning the input space instead in order to scale to massive text corpora while not sacrificing the performance of the embeddings. Our training procedure does not involve any parameter synchronization except a final sub-model merge phase that typically executes in a few minutes. Our distributed training scales seamlessly to large corpus sizes and we get comparable and sometimes even up to 45% performance improvement in a variety of NLP benchmarks using models trained by our distributed procedure which requires 1/101/10 of the time taken by the baseline approach. Finally we also show that we are robust to missing words in sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201

    Comparative Analysis of Word Embeddings for Capturing Word Similarities

    Full text link
    Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.Comment: Part of the 6th International Conference on Natural Language Processing (NATP 2020

    Word Embeddings: A Survey

    Full text link
    This work lists and describes the main recent strategies for building fixed-length, dense and distributed representations for words, based on the distributional hypothesis. These representations are now commonly called word embeddings and, in addition to encoding surprisingly good syntactic and semantic information, have been proven useful as extra features in many downstream NLP tasks.Comment: 10 pages, 2 tables, 1 imag

    Riemannian Optimization for Skip-Gram Negative Sampling

    Full text link
    Skip-Gram Negative Sampling (SGNS) word embedding model, well known by its implementation in "word2vec" software, is usually optimized by stochastic gradient descent. However, the optimization of SGNS objective can be viewed as a problem of searching for a good matrix with the low-rank constraint. The most standard way to solve this type of problems is to apply Riemannian optimization framework to optimize the SGNS objective over the manifold of required low-rank matrices. In this paper, we propose an algorithm that optimizes SGNS objective using Riemannian optimization and demonstrates its superiority over popular competitors, such as the original method to train SGNS and SVD over SPPMI matrix.Comment: 9 pages, 4 figures, ACL 201

    PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks

    Full text link
    Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the \textit{predictive text embedding} (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune.Comment: KDD 201
    • …
    corecore