194 research outputs found
GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding
Learning continuous representations of nodes is attracting growing interest
in both academia and industry recently, due to their simplicity and
effectiveness in a variety of applications. Most of existing node embedding
algorithms and systems are capable of processing networks with hundreds of
thousands or a few millions of nodes. However, how to scale them to networks
that have tens of millions or even hundreds of millions of nodes remains a
challenging problem. In this paper, we propose GraphVite, a high-performance
CPU-GPU hybrid system for training node embeddings, by co-optimizing the
algorithm and the system. On the CPU end, augmented edge samples are parallelly
generated by random walks in an online fashion on the network, and serve as the
training data. On the GPU end, a novel parallel negative sampling is proposed
to leverage multiple GPUs to train node embeddings simultaneously, without much
data transfer and synchronization. Moreover, an efficient collaboration
strategy is proposed to further reduce the synchronization cost between CPUs
and GPUs. Experiments on multiple real-world networks show that GraphVite is
super efficient. It takes only about one minute for a network with 1 million
nodes and 5 million edges on a single machine with 4 GPUs, and takes around 20
hours for a network with 66 million nodes and 1.8 billion edges. Compared to
the current fastest system, GraphVite is about 50 times faster without any
sacrifice on performance.Comment: accepted at WWW 201
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments
Deep neural networks (DNNs) have become core computation components within
low latency Function as a Service (FaaS) prediction pipelines: including image
recognition, object detection, natural language processing, speech synthesis,
and personalized recommendation pipelines. Cloud computing, as the de-facto
backbone of modern computing infrastructure for both enterprise and consumer
applications, has to be able to handle user-defined pipelines of diverse DNN
inference workloads while maintaining isolation and latency guarantees, and
minimizing resource waste. The current solution for guaranteeing isolation
within FaaS is suboptimal -- suffering from "cold start" latency. A major cause
of such inefficiency is the need to move large amount of model data within and
across servers. We propose TrIMS as a novel solution to address these issues.
Our proposed solution consists of a persistent model store across the GPU, CPU,
local storage, and cloud storage hierarchy, an efficient resource management
layer that provides isolation, and a succinct set of application APIs and
container technologies for easy and transparent integration with FaaS, Deep
Learning (DL) frameworks, and user code. We demonstrate our solution by
interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x
speedup in latency for image classification models and up to 210x speedup for
large models. We achieve up to 8x system throughput improvement.Comment: In Proceedings CLOUD 201
Bringing UMAP Closer to the Speed of Light with GPU Acceleration
The Uniform Manifold Approximation and Projection (UMAP) algorithm has become
widely popular for its ease of use, quality of results, and support for
exploratory, unsupervised, supervised, and semi-supervised learning. While many
algorithms can be ported to a GPU in a simple and direct fashion, such efforts
have resulted in inefficient and inaccurate versions of UMAP. We show a number
of techniques that can be used to make a faster and more faithful GPU version
of UMAP, and obtain speedups of up to 100x in practice. Many of these design
choices/lessons are general purpose and may inform the conversion of other
graph and manifold learning algorithms to use GPUs. Our implementation has been
made publicly available as part of the open source RAPIDS cuML library
(https://github.com/rapidsai/cuml)
Distributed Graph Embedding with Information-Oriented Random Walks
Graph embedding maps graph nodes to low-dimensional vectors, and is widely
adopted in machine learning tasks. The increasing availability of billion-edge
graphs underscores the importance of learning efficient and effective
embeddings on large graphs, such as link prediction on Twitter with over one
billion edges. Most existing graph embedding methods fall short of reaching
high data scalability. In this paper, we present a general-purpose,
distributed, information-centric random walk-based graph embedding framework,
DistGER, which can scale to embed billion-edge graphs. DistGER incrementally
computes information-centric random walks. It further leverages a
multi-proximity-aware, streaming, parallel graph partitioning strategy,
simultaneously achieving high local partition quality and excellent workload
balancing across machines. DistGER also improves the distributed Skip-Gram
learning model to generate node embeddings by optimizing the access locality,
CPU throughput, and synchronization efficiency. Experiments on real-world
graphs demonstrate that compared to state-of-the-art distributed graph
embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph,
DistGER exhibits 2.33x-129x acceleration, 45% reduction in cross-machines
communication, and > 10% effectiveness improvement in downstream tasks
Dataflow Programming and Acceleration of Computationally-Intensive Algorithms
The volume of unstructured textual information continues to grow due to recent technological advancements. This resulted in an exponential growth of information generated in various formats, including blogs, posts, social networking, and enterprise documents. Numerous Enterprise Architecture (EA) documents are also created daily, such as reports, contracts, agreements, frameworks, architecture requirements, designs, and operational guides. The processing and computation of this massive amount of unstructured information necessitate substantial computing capabilities and the implementation of new techniques. It is critical to manage this unstructured information through a centralized knowledge management platform. Knowledge management is the process of managing information within an organization. This involves creating, collecting, organizing, and storing information in a way that makes it easily accessible and usable. The research involved the development textual knowledge management system, and two use cases were considered for extracting textual knowledge from documents. The first case study focused on the safety-critical documents of a railway enterprise. Safety is of paramount importance in the railway industry. There are several EA documents including manuals, operational procedures, and technical guidelines that contain critical information. Digitalization of these documents is essential for analysing vast amounts of textual knowledge that exist in these documents to improve the safety and security of railway operations. A case study was conducted between the University of Huddersfield and the Railway Safety Standard Board (RSSB) to analyse EA safety documents using Natural language processing (NLP). A graphical user interface was developed that includes various document processing features such as semantic search, document mapping, text summarization, and visualization of key trends. For the second case study, open-source data was utilized, and textual knowledge was extracted. Several features were also developed, including kernel distribution, analysis offkey trends, and sentiment analysis of words (such as unique, positive, and negative) within the documents. Additionally, a heterogeneous framework was designed using CPU/GPU and FPGAs to analyse the computational performance of document mapping
- …