Search CORE

2 research outputs found

Deep Graph Library Optimizations for Intel(R) x86 Architecture

Author: Avancha Sasikanth
Md Vasimuddin
Misra Sanchit
Mohanty Ramanarayan
Publication venue
Publication date: 13/07/2020
Field of study

The Deep Graph Library (DGL) was designed as a tool to enable structure learning from graphs, by supporting a core abstraction for graphs, including the popular Graph Neural Networks (GNN). DGL contains implementations of all core graph operations for both the CPU and GPU. In this paper, we focus specifically on CPU implementations and present performance analysis, optimizations and results across a set of GNN applications using the latest version of DGL(0.4.3). Across 7 applications, we achieve speed-ups ranging from1 1.5x-13x over the baseline CPU implementations

arXiv.org e-Print Archive

Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning & HPC Workloads

Author: Adelman Menachem
Aggarwal Deepti
Anderson Cristina
Avancha Sasikanth
Breuer Alexander
Bruestle Jeremy
Chaudhary Narendra
Georganas Evangelos
Heinecke Alexander
Kalamkar Dhiraj
Kundu Abhisek
Kutnick Denise
Laub Frank
Md Vasimuddin
Misra Sanchit
Mohanty Ramanarayan
Pabst Hans
Retford Brian
Ziv Barukh
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/11/2021
Field of study

During the past decade, novel Deep Learning (DL) algorithms, workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload and hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, reference implementations are built via DL framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual Tensor ISA), which subsequently can be utilized as building-blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific. We demonstrate the efficacy and viability of our approach using standalone kernels and end-to-end DL & HPC workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms

arXiv.org e-Print Archive