2 research outputs found
Parallelizing Word2Vec in Shared and Distributed Memory
Word2Vec is a widely used algorithm for extracting low-dimensional vector
representations of words. It generated considerable excitement in the machine
learning and natural language processing (NLP) communities recently due to its
exceptional performance in many NLP applications such as named entity
recognition, sentiment analysis, machine translation and question answering.
State-of-the-art algorithms including those by Mikolov et al. have been
parallelized for multi-core CPU architectures but are based on vector-vector
operations that are memory-bandwidth intensive and do not efficiently use
computational resources. In this paper, we improve reuse of various data
structures in the algorithm through the use of minibatching, hence allowing us
to express the problem using matrix multiply operations. We also explore
different techniques to distribute word2vec computation across nodes in a
compute cluster, and demonstrate good strong scalability up to 32 nodes. In
combination, these techniques allow us to scale up the computation near
linearly across cores and nodes, and process hundreds of millions of words per
second, which is the fastest word2vec implementation to the best of our
knowledge.Comment: Added more result
Matrix Factorization on GPUs with Memory Optimization and Approximate Computing
Matrix factorization (MF) discovers latent features from observations, which
has shown great promises in the fields of collaborative filtering, data
compression, feature extraction, word embedding, etc. While many
problem-specific optimization techniques have been proposed, alternating least
square (ALS) remains popular due to its general applicability e.g. easy to
handle positive-unlabeled inputs, fast convergence and parallelization
capability. Current MF implementations are either optimized for a single
machine or with a need of a large computer cluster but still are insufficient.
This is because a single machine provides limited compute power for large-scale
data while multiple machines suffer from the network communication bottleneck.
To address the aforementioned challenge, accelerating ALS on graphics
processing units (GPUs) is a promising direction. We propose the novel approach
in enhancing the MF efficiency via both memory optimization and approximate
computing. The former exploits GPU memory hierarchy to increase data reuse,
while the later reduces unnecessary computing without hurting the convergence
of learning algorithms. Extensive experiments on large-scale datasets show that
our solution not only outperforms the competing CPU solutions by a large margin
but also has a 2x-4x performance gain compared to the state-of-the-art GPU
solutions. Our implementations are open-sourced and publicly available