25,870 research outputs found
DLVM: A modern compiler infrastructure for deep learning systems
Deep learning software demands reliability and performance. However, many of
the existing deep learning frameworks are software libraries that act as an
unsafe DSL in Python and a computation graph interpreter. We present DLVM, a
design and implementation of a compiler infrastructure with a linear algebra
intermediate representation, algorithmic differentiation by adjoint code
generation, domain-specific optimizations and a code generator targeting GPU
via LLVM. Designed as a modern compiler infrastructure inspired by LLVM, DLVM
is more modular and more generic than existing deep learning compiler
frameworks, and supports tensor DSLs with high expressivity. With our
prototypical staged DSL embedded in Swift, we argue that the DLVM system
enables a form of modular, safe and performant frameworks for deep learning
Scalable Label Propagation for Multi-relational Learning on the Tensor Product of Graphs
Multi-relational learning on knowledge graphs infers high-order relations
among the entities across the graphs. This learning task can be solved by label
propagation on the tensor product of the knowledge graphs to learn the
high-order relations as a tensor. In this paper, we generalize a widely used
label propagation model to the normalized tensor product graph, and propose an
optimization formulation and a scalable Low-rank Tensor-based Label Propagation
algorithm (LowrankTLP) to infer multi-relations for two learning tasks,
hyperlink prediction and multiple graph alignment. The optimization formulation
minimizes the upper bound of the noisy tensor estimation error for multiple
graph alignment, by learning with a subset of the eigen-pairs in the spectrum
of the normalized tensor product graph. We also provide a data-dependent
transductive Rademacher bound for binary hyperlink prediction. We accelerate
LowrankTLP with parallel tensor computation which enables label propagation on
a tensor product of 100 graphs each of size 1000 in less than half hour in the
simulation. LowrankTLP was also applied to predicting the author-paper-venue
hyperlinks in publication records, alignment of segmented regions across up to
26 CT-scan images and alignment of protein-protein interaction networks across
multiple species. The experiments demonstrate that LowrankTLP indeed well
approximates the original label propagation with better scalability and
accuracy.Comment: 9 pages, 6 figure
RankMap: A Platform-Aware Framework for Distributed Learning from Dense Datasets
This paper introduces RankMap, a platform-aware end-to-end framework for
efficient execution of a broad class of iterative learning algorithms for
massive and dense datasets. Our framework exploits data structure to factorize
it into an ensemble of lower rank subspaces. The factorization creates sparse
low-dimensional representations of the data, a property which is leveraged to
devise effective mapping and scheduling of iterative learning algorithms on the
distributed computing machines. We provide two APIs, one matrix-based and one
graph-based, which facilitate automated adoption of the framework for
performing several contemporary learning applications. To demonstrate the
utility of RankMap, we solve sparse recovery and power iteration problems on
various real-world datasets with up to 1.8 billion non-zeros. Our evaluations
are performed on Amazon EC2 and IBM iDataPlex servers using up to 244 cores.
The results demonstrate up to two orders of magnitude improvements in memory
usage, execution speed, and bandwidth compared with the best reported prior
work, while achieving the same level of learning accuracy.Comment: 13 pages, 10 figure
PyTorch-BigGraph: A Large-scale Graph Embedding System
Graph embedding methods produce unsupervised node features from graphs that
can then be used for a variety of machine learning tasks. Modern graphs,
particularly in industrial applications, contain billions of nodes and
trillions of edges, which exceeds the capability of existing embedding systems.
We present PyTorch-BigGraph (PBG), an embedding system that incorporates
several modifications to traditional multi-relation embedding systems that
allow it to scale to graphs with billions of nodes and trillions of edges. PBG
uses graph partitioning to train arbitrarily large embeddings on either a
single machine or in a distributed environment. We demonstrate comparable
performance with existing embedding systems on common benchmarks, while
allowing for scaling to arbitrarily large graphs and parallelization on
multiple machines. We train and evaluate embeddings on several large social
network graphs as well as the full Freebase dataset, which contains over 100
million nodes and 2 billion edges
AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks
New types of machine learning hardware in development and entering the market
hold the promise of revolutionizing deep learning in a manner as profound as
GPUs. However, existing software frameworks and training algorithms for deep
learning have yet to evolve to fully leverage the capability of the new wave of
silicon. We already see the limitations of existing algorithms for models that
exploit structured input via complex and instance-dependent control flow, which
prohibits minibatching. We present an asynchronous model-parallel (AMP)
training algorithm that is specifically motivated by training on networks of
interconnected devices. Through an implementation on multi-core CPUs, we show
that AMP training converges to the same accuracy as conventional synchronous
training algorithms in a similar number of epochs, but utilizes the available
hardware more efficiently even for small minibatch sizes, resulting in
significantly shorter overall training times. Our framework opens the door for
scaling up a new class of deep learning models that cannot be efficiently
trained today.Comment: 17 pages, 13 figure
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
Heterogeneous many-cores are now an integral part of modern computing systems
ranging from embedding systems to supercomputers. While heterogeneous many-core
design offers the potential for energy-efficient high-performance, such
potential can only be unlocked if the application programs are suitably
parallel and can be made to match the underlying heterogeneous platform. In
this article, we provide a comprehensive survey for parallel programming models
for heterogeneous many-core architectures and review the compiling techniques
of improving programmability and portability. We examine various software
optimization techniques for minimizing the communicating overhead between
heterogeneous computing devices. We provide a road map for a wide variety of
different research areas. We conclude with a discussion on open issues in the
area and potential research directions. This article provides both an
accessible introduction to the fast-moving area of heterogeneous programming
and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance
Computin
STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics
Various general-purpose distributed systems have been proposed to cope with
high-diversity applications in the pipeline of Big Data analytics. Most of them
provide simple yet effective primitives to simplify distributed programming.
While the rigid primitives offer great ease of use to savvy programmers, they
probably compromise efficiency in performance and flexibility in data
representation and programming specifications, which are critical properties in
real systems. In this paper, we discuss the limitations of coarse-grained
primitives and aim to provide an alternative for users to have flexible control
over distributed programs and operate globally shared data more efficiently. We
develop STEP, a novel distributed framework based on in-memory key-value store.
The key idea of STEP is to adapt multi-threading in a single machine to a
distributed environment. STEP enables users to take fine-grained control over
distributed threads and apply task-specific optimizations in a flexible manner.
The underlying key-value store serves as distributed shared memory to keep
globally shared data. To ensure ease-of-use, STEP offers plentiful effective
interfaces in terms of distributed shared data manipulation, cluster
management, distributed thread management and synchronization. We conduct
extensive experimental studies to evaluate the performance of STEP using real
data sets. The results show that STEP outperforms the state-of-the-art
general-purpose distributed systems as well as a specialized ML platform in
many real applications
Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms
Stochastic algorithms are efficient approaches to solving machine learning
and optimization problems. In this paper, we propose a general framework called
Splash for parallelizing stochastic algorithms on multi-node distributed
systems. Splash consists of a programming interface and an execution engine.
Using the programming interface, the user develops sequential stochastic
algorithms without concerning any detail about distributed computing. The
algorithm is then automatically parallelized by a communication-efficient
execution engine. We provide theoretical justifications on the optimal rate of
convergence for parallelizing stochastic gradient descent. Splash is built on
top of Apache Spark. The real-data experiments on logistic regression,
collaborative filtering and topic modeling verify that Splash yields
order-of-magnitude speedup over single-thread stochastic algorithms and over
state-of-the-art implementations on Spark.Comment: redo experiments to learn bigger models; compare Splash with
state-of-the-art implementations on Spar
Neural Feature Learning From Relational Database
Feature engineering is one of the most important but most tedious tasks in
data science. This work studies automation of feature learning from relational
database. We first prove theoretically that finding the optimal features from
relational data for predictive tasks is NP-hard. We propose an efficient
rule-based approach based on heuristics and a deep neural network to
automatically learn appropriate features from relational data. We benchmark our
approaches in ensembles in past Kaggle competitions. Our new approach wins late
medals and beats the state-of-the-art solutions with significant margins. To
the best of our knowledge, this is the first time an automated data science
system could win medals in Kaggle competitions with complex relational
database
Simple, Distributed, and Accelerated Probabilistic Programming
We describe a simple, low-level approach for embedding probabilistic
programming in a deep learning ecosystem. In particular, we distill
probabilistic programming down to a single abstraction---the random variable.
Our lightweight implementation in TensorFlow enables numerous applications: a
model-parallel variational auto-encoder (VAE) with 2nd-generation tensor
processing units (TPUv2s); a data-parallel autoregressive model (Image
Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a
state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256
CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2
chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.Comment: Appears in Neural Information Processing Systems, 2018. Code
available at http://bit.ly/2JpFip
- …