Search CORE

25,870 research outputs found

DLVM: A modern compiler infrastructure for deep learning systems

Author: Adve Vikram
Schwartz Lane
Wei Richard
Publication venue
Publication date: 02/02/2018
Field of study

Deep learning software demands reliability and performance. However, many of the existing deep learning frameworks are software libraries that act as an unsafe DSL in Python and a computation graph interpreter. We present DLVM, a design and implementation of a compiler infrastructure with a linear algebra intermediate representation, algorithmic differentiation by adjoint code generation, domain-specific optimizations and a code generator targeting GPU via LLVM. Designed as a modern compiler infrastructure inspired by LLVM, DLVM is more modular and more generic than existing deep learning compiler frameworks, and supports tensor DSLs with high expressivity. With our prototypical staged DSL embedded in Swift, we argue that the DLVM system enables a form of modular, safe and performant frameworks for deep learning

arXiv.org e-Print Archive

Scalable Label Propagation for Multi-relational Learning on the Tensor Product of Graphs

Author: Karypis George
Kuang Rui
Li Zhuliu
Petegrosso Raphael
Smith Shaden
Sterling David
Publication venue
Publication date: 18/05/2020
Field of study

Multi-relational learning on knowledge graphs infers high-order relations among the entities across the graphs. This learning task can be solved by label propagation on the tensor product of the knowledge graphs to learn the high-order relations as a tensor. In this paper, we generalize a widely used label propagation model to the normalized tensor product graph, and propose an optimization formulation and a scalable Low-rank Tensor-based Label Propagation algorithm (LowrankTLP) to infer multi-relations for two learning tasks, hyperlink prediction and multiple graph alignment. The optimization formulation minimizes the upper bound of the noisy tensor estimation error for multiple graph alignment, by learning with a subset of the eigen-pairs in the spectrum of the normalized tensor product graph. We also provide a data-dependent transductive Rademacher bound for binary hyperlink prediction. We accelerate LowrankTLP with parallel tensor computation which enables label propagation on a tensor product of 100 graphs each of size 1000 in less than half hour in the simulation. LowrankTLP was also applied to predicting the author-paper-venue hyperlinks in publication records, alignment of segmented regions across up to 26 CT-scan images and alignment of protein-protein interaction networks across multiple species. The experiments demonstrate that LowrankTLP indeed well approximates the original label propagation with better scalability and accuracy.Comment: 9 pages, 6 figure

arXiv.org e-Print Archive

RankMap: A Platform-Aware Framework for Distributed Learning from Dense Datasets

Author: Baraniuk Richard G.
Dyer Eva L.
Koushanfar Farinaz
Mirhoseini Azalia
Songhori Ebrahim. M.
Publication venue
Publication date: 27/10/2016
Field of study

This paper introduces RankMap, a platform-aware end-to-end framework for efficient execution of a broad class of iterative learning algorithms for massive and dense datasets. Our framework exploits data structure to factorize it into an ensemble of lower rank subspaces. The factorization creates sparse low-dimensional representations of the data, a property which is leveraged to devise effective mapping and scheduling of iterative learning algorithms on the distributed computing machines. We provide two APIs, one matrix-based and one graph-based, which facilitate automated adoption of the framework for performing several contemporary learning applications. To demonstrate the utility of RankMap, we solve sparse recovery and power iteration problems on various real-world datasets with up to 1.8 billion non-zeros. Our evaluations are performed on Amazon EC2 and IBM iDataPlex servers using up to 244 cores. The results demonstrate up to two orders of magnitude improvements in memory usage, execution speed, and bandwidth compared with the best reported prior work, while achieving the same level of learning accuracy.Comment: 13 pages, 10 figure

arXiv.org e-Print Archive

PyTorch-BigGraph: A Large-scale Graph Embedding System

Author: Bose Abhijit
Lacroix Timothee
Lerer Adam
Peysakhovich Alex
Shen Jiajun
Wehrstedt Luca
Wu Ledell
Publication venue
Publication date: 09/04/2019
Field of study

Graph embedding methods produce unsupervised node features from graphs that can then be used for a variety of machine learning tasks. Modern graphs, particularly in industrial applications, contain billions of nodes and trillions of edges, which exceeds the capability of existing embedding systems. We present PyTorch-BigGraph (PBG), an embedding system that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. PBG uses graph partitioning to train arbitrarily large embeddings on either a single machine or in a distributed environment. We demonstrate comparable performance with existing embedding systems on common benchmarks, while allowing for scaling to arbitrarily large graphs and parallelization on multiple machines. We train and evaluate embeddings on several large social network graphs as well as the full Freebase dataset, which contains over 100 million nodes and 2 billion edges

arXiv.org e-Print Archive

AMPNet: Asynchronous Model-Parallel Training for Dynamic Neural Networks

Author: Gaunt Alexander L.
Johnson Matthew A.
Riechert Maik
Tarlow Daniel
Tomioka Ryota
Vytiniotis Dimitrios
Webster Sam
Publication venue
Publication date: 22/06/2017
Field of study

New types of machine learning hardware in development and entering the market hold the promise of revolutionizing deep learning in a manner as profound as GPUs. However, existing software frameworks and training algorithms for deep learning have yet to evolve to fully leverage the capability of the new wave of silicon. We already see the limitations of existing algorithms for models that exploit structured input via complex and instance-dependent control flow, which prohibits minibatching. We present an asynchronous model-parallel (AMP) training algorithm that is specifically motivated by training on networks of interconnected devices. Through an implementation on multi-core CPUs, we show that AMP training converges to the same accuracy as conventional synchronous training algorithms in a similar number of epochs, but utilizes the available hardware more efficiently even for small minibatch sizes, resulting in significantly shorter overall training times. Our framework opens the door for scaling up a new class of deep learning models that cannot be efficiently trained today.Comment: 17 pages, 13 figure

arXiv.org e-Print Archive

Parallel Programming Models for Heterogeneous Many-Cores : A Survey

Author: Fang Jianbin
Huang Chun
Tang Tao
Wang Zheng
Publication venue
Publication date: 05/05/2020
Field of study

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance Computin

arXiv.org e-Print Archive

STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

Author: Huang Linpeng
Mei Yijie
Shen Yanyan
Zhu Yanmin
Publication venue
Publication date: 12/12/2018
Field of study

Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications

arXiv.org e-Print Archive

Splash: User-friendly Programming Interface for Parallelizing Stochastic Algorithms

Author: Jordan Michael I.
Zhang Yuchen
Publication venue
Publication date: 22/09/2015
Field of study

Stochastic algorithms are efficient approaches to solving machine learning and optimization problems. In this paper, we propose a general framework called Splash for parallelizing stochastic algorithms on multi-node distributed systems. Splash consists of a programming interface and an execution engine. Using the programming interface, the user develops sequential stochastic algorithms without concerning any detail about distributed computing. The algorithm is then automatically parallelized by a communication-efficient execution engine. We provide theoretical justifications on the optimal rate of convergence for parallelizing stochastic gradient descent. Splash is built on top of Apache Spark. The real-data experiments on logistic regression, collaborative filtering and topic modeling verify that Splash yields order-of-magnitude speedup over single-thread stochastic algorithms and over state-of-the-art implementations on Spark.Comment: redo experiments to learn bigger models; compare Splash with state-of-the-art implementations on Spar

arXiv.org e-Print Archive

Neural Feature Learning From Relational Database

Author: Buesser Beat
Lam Hoang Thanh
Minh Tran Ngoc
Sinn Mathieu
Wistuba Martin
Publication venue
Publication date: 15/06/2019
Field of study

Feature engineering is one of the most important but most tedious tasks in data science. This work studies automation of feature learning from relational database. We first prove theoretically that finding the optimal features from relational data for predictive tasks is NP-hard. We propose an efficient rule-based approach based on heuristics and a deep neural network to automatically learn appropriate features from relational data. We benchmark our approaches in ensembles in past Kaggle competitions. Our new approach wins late medals and beats the state-of-the-art solutions with significant margins. To the best of our knowledge, this is the first time an automated data science system could win medals in Kaggle competitions with complex relational database

arXiv.org e-Print Archive

Simple, Distributed, and Accelerated Probabilistic Programming

Author: Hoffman Matthew
Johnson Matthew
Moore Dave
Radul Alexey
Saurous Rif A.
Suter Christopher
Tran Dustin
Vasudevan Srinivas
Publication venue
Publication date: 28/11/2018
Field of study

We describe a simple, low-level approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstraction---the random variable. Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.Comment: Appears in Neural Information Processing Systems, 2018. Code available at http://bit.ly/2JpFip

arXiv.org e-Print Archive