86 research outputs found
Enabling parallelism and optimizations in data mining algorithms for power-law data
Today's data mining tasks aim to extract meaningful information from a large amount of data in a reasonable time mainly via means of --- a) algorithmic advances, such as fast approximate algorithms and efficient learning algorithms, and b) architectural advances, such as machines with massive compute capacity involving distributed multi-core processors and high throughput accelerators. For current and future generation processors, parallel algorithms are critical for fully utilizing computing resources. Furthermore, exploiting data properties for performance gain becomes crucial for data mining applications. In this work, we focus our attention on power-law behavior –-- a common property found in a large class of data, such as text data, internet traffic, and click-stream data. Specifically, we address the following questions in the context of power-law data: How well do the critical data mining algorithms of current interest fit with today's parallel architectures? Which algorithmic and mapping opportunities can be leveraged to further improve performance?, and What are the relative challenges and gains for such approaches? Specifically, we first investigate the suitability of the "frequency estimation" problem for GPU-scale parallelism. Sketching algorithms are a popular choice for this task due to their desirable trade-off between estimation accuracy and space-time efficiency. However, most of the past work on sketch-based frequency estimation focused on CPU implementations. In our work, we propose a novel approach for sketches, which exploits the natural skewness in the power-law data to efficiently utilize the massive amounts of parallelism in modern GPUs. Next, we explore the problem of "identifying top-K frequent elements" for distributed data streams on modern distributed settings with both multi-core and multi-node CPU parallelism. Sketch-based approaches, such as Count-Min Sketch (CMS) with top-K heap, have an excellent update time but lacks the important property of reducibility, which is needed for exploiting data parallelism. On the other end, the popular Frequent Algorithm (FA) leads to reducible summaries, but its update costs are high. Our approach Topkapi, gives the best of both worlds, i.e., it is reducible like FA and has an efficient update time similar to CMS. For power-law data, Topkapi possesses strong theoretical guarantees and leads to significant performance gains, relative to past work. Finally, we study Word2Vec, a popular word embedding method widely used in Machine learning and Natural Language Processing applications, such as machine translation, sentiment analysis, and query answering. This time, we target Single Instruction Multiple Data (SIMD) parallelism. With the increasing vector lengths in commodity CPUs, such as AVX-512 with a vector length of 512 bits, efficient vector processing unit utilization becomes a major performance game-changer. By employing a static multi-version code generation strategy coupled with an algorithmic approximation based on the power-law frequency distribution of words, we achieve significant reductions in training time relative to the state-of-the-art.Ph.D
LASAGNE: Locality And Structure Aware Graph Node Embedding
In this work we propose Lasagne, a methodology to learn locality and
structure aware graph node embeddings in an unsupervised way. In particular, we
show that the performance of existing random-walk based approaches depends
strongly on the structural properties of the graph, e.g., the size of the
graph, whether the graph has a flat or upward-sloping Network Community Profile
(NCP), whether the graph is expander-like, whether the classes of interest are
more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that
are strongly expander-like, existing methods lead to random walks that expand
rapidly, touching many dissimilar nodes, thereby leading to lower-quality
vector representations that are less useful for downstream tasks. Rather than
relying on global random walks or neighbors within fixed hop distances, Lasagne
exploits strongly local Approximate Personalized PageRank stationary
distributions to more precisely engineer local information into node
embeddings. This leads, in particular, to more meaningful and more useful
vector representations of nodes in poorly-structured graphs. We show that
Lasagne leads to significant improvement in downstream multi-label
classification for larger graphs with flat NCPs, that it is comparable for
smaller graphs with upward-sloping NCPs, and that is comparable to existing
methods for link prediction tasks
Large-Scale User Modeling with Recurrent Neural Networks for Music Discovery on Multiple Time Scales
The amount of content on online music streaming platforms is immense, and
most users only access a tiny fraction of this content. Recommender systems are
the application of choice to open up the collection to these users.
Collaborative filtering has the disadvantage that it relies on explicit
ratings, which are often unavailable, and generally disregards the temporal
nature of music consumption. On the other hand, item co-occurrence algorithms,
such as the recently introduced word2vec-based recommenders, are typically left
without an effective user representation. In this paper, we present a new
approach to model users through recurrent neural networks by sequentially
processing consumed items, represented by any type of embeddings and other
context features. This way we obtain semantically rich user representations,
which capture a user's musical taste over time. Our experimental analysis on
large-scale user data shows that our model can be used to predict future songs
a user will likely listen to, both in the short and long term.Comment: Author pre-print version, 20 pages, 6 figures, 4 table
Asynchronous Training of Word Embeddings for Large Text Corpora
Word embeddings are a powerful approach for analyzing language and have been
widely popular in numerous tasks in information retrieval and text mining.
Training embeddings over huge corpora is computationally expensive because the
input is typically sequentially processed and parameters are synchronously
updated. Distributed architectures for asynchronous training that have been
proposed either focus on scaling vocabulary sizes and dimensionality or suffer
from expensive synchronization latencies.
In this paper, we propose a scalable approach to train word embeddings by
partitioning the input space instead in order to scale to massive text corpora
while not sacrificing the performance of the embeddings. Our training procedure
does not involve any parameter synchronization except a final sub-model merge
phase that typically executes in a few minutes. Our distributed training scales
seamlessly to large corpus sizes and we get comparable and sometimes even up to
45% performance improvement in a variety of NLP benchmarks using models trained
by our distributed procedure which requires of the time taken by the
baseline approach. Finally we also show that we are robust to missing words in
sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201
GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding
Learning continuous representations of nodes is attracting growing interest
in both academia and industry recently, due to their simplicity and
effectiveness in a variety of applications. Most of existing node embedding
algorithms and systems are capable of processing networks with hundreds of
thousands or a few millions of nodes. However, how to scale them to networks
that have tens of millions or even hundreds of millions of nodes remains a
challenging problem. In this paper, we propose GraphVite, a high-performance
CPU-GPU hybrid system for training node embeddings, by co-optimizing the
algorithm and the system. On the CPU end, augmented edge samples are parallelly
generated by random walks in an online fashion on the network, and serve as the
training data. On the GPU end, a novel parallel negative sampling is proposed
to leverage multiple GPUs to train node embeddings simultaneously, without much
data transfer and synchronization. Moreover, an efficient collaboration
strategy is proposed to further reduce the synchronization cost between CPUs
and GPUs. Experiments on multiple real-world networks show that GraphVite is
super efficient. It takes only about one minute for a network with 1 million
nodes and 5 million edges on a single machine with 4 GPUs, and takes around 20
hours for a network with 66 million nodes and 1.8 billion edges. Compared to
the current fastest system, GraphVite is about 50 times faster without any
sacrifice on performance.Comment: accepted at WWW 201
- …