Intel nGraph: An Intermediate Representation, Compiler, and Executor for
  Deep Learning by Cyphers, Scott et al.
Intel® nGraph™
An Intermediate Representation, Compiler, and Executor for Deep Learning
Sco Cyphers
Arjun K. Bansal
Anahita Bhiwandiwalla
Jayaram Bobba
Mahew Brookhart
Avijit Chakraborty
Will Constable
sco.cyphers@intel.com
arjun.bansal@intel.com
anahita.bhiwandiwalla@intel.com
jayaram.bobba@intel.com
mahew.i.brookhart@intel.com
avijit.chakraborty@intel.com
will.h.constable@intel.com
Intel
Christian Convey
Leona Cook
Omar Kanawi
Robert Kimball
Jason Knight
Nikolay Korovaiko
Varun Kumar
christian.convey@intel.com
leona.cook@intel.com
omar.kanawi@intel.com
robert.kimball@intel.com
jason.knight@intel.com
nikolay.korovaiko@intel.com
varun.v.kumar@intel.com
Intel
Yixing Lao
Christopher R. Lishka
Jaikrishnan Menon
Jennifer Myers
Sandeep Aswath Narayana
Adam Procter
Tristan J. Webb
yixing.lao@intel.com
christopher.r.lishka@intel.com
jaikrishnan.menon@intel.com
jennifer.myers@intel.com
sandeep.aswath.narayana@intel.com
adam.m.procter@intel.com
tristan.webb@intel.com
Intel
ABSTRACT
e Deep Learning (DL) community sees many novel topologies
published each year. Achieving high performance on each new
topology remains challenging, as each requires some level of man-
ual eort. is issue is compounded by the proliferation of frame-
works and hardware platforms. e current approach, which we call
“direct optimization”, requires deep changes within each framework
to improve the training performance for each hardware backend
(CPUs, GPUs, FPGAs, ASICs) and requires O(f p) eort; where f is
the number of frameworks and p is the number of platforms. While
optimized kernels for deep-learning primitives are provided via
libraries like Intel® Math Kernel Library for Deep Neural Networks
(MKL-DNN), there are several compiler-inspired ways in which
performance can be further optimized. Building on our experience
creating neon (a fast deep learning library on GPUs), we developed
Intel nGraph, a soon to be open-sourced C++ library to simplify the
realization of optimized deep learning performance across frame-
works and hardware platforms. Initially-supported frameworks
include TensorFlow, MXNet, and Intel® neon framework. Initial
backends are Intel Architecture CPUs (CPU), the Intel® Nervana
Neural Network Processor™ (NNP), and NVIDIA GPUs. Currently
supported compiler optimizations include ecient memory man-
agement and data layout abstraction. In this paper, we describe
our overall architecture and its core components. In the future,
we envision extending nGraph API support to a wider range of
frameworks, hardware (including FPGAs and ASICs), and compiler
optimizations (training versus inference optimizations, multi-node
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SYSML 2018, Stanford University
© 2016 Copyright held by the owner/author(s). .
DOI:
and multi-device scaling via ecient sub-graph partitioning, and
HW-specic compounding of operations).
1 INTRODUCTION
Deep learning frameworks are libraries that provide domain-specic
languages for dening deep learning computations and APIs for
managing data and executing computations. Backends’ implemen-
tations are encapsulated so that the same computation can execute
on multiple backends, such as CPUs and GPUs. We adopt the orga-
nization of compilers such as LLVM[7] by converting framework-
specic computation denitions into a framework-independent
intermediate representation (IR) that we compile into a form that
can execute on the backend. An nGraph framework bridge acts as
a framework backend. Each nGraph backend has a transformer
that compiles or interprets the IR and provides an allocation and
execution API that the framework bridges use to implement the
framework’s API. A key dierence between compilers for languages
like C++ and compilers for deep learning frameworks is that with
deep learning, the data being operated on is large and variable-sized,
but highly amenable to parallelization.
1.1 Related Work
Google’s accelerated linear algebra (XLA)[6] compiler acts as an
experimental backend for TensorFlow[4]. Unlike nGraph, the XLA
project has made no public comments about support of frameworks
other than TensorFlow.
ar
X
iv
:1
80
1.
08
05
8v
2 
 [c
s.D
C]
  3
0 J
an
 20
18
SYSML 2018, February 16, 2018, Stanford University Cyphers, Bansal, et al.
e DMLC group announced the NNVM[8] project as a light-
weight graph optimization library for deep learning and later an-
nounced TVM[5] as an ahead-of-time compiler that supports mul-
tiple hardware platforms and interoperates with NNVM. NNVM
leaves operator set unspecied, which makes dierent frontends
and backends incompatible. nGraph, XLA, and LLVM use a xed,
but extensible, IR operation set.
DLVM [9] is a project of the University of Illinois Urbana-Cham-
paign. DLVM proposes an LLVM inspired modular IR with full
control ow and a side-eect free representation. It remains to be
seen if the more exible IR is capable of supporting the performance
optimizations enabled by simpler data ow graph IRs like those of
nGraph, XLA, and NNVM.
ONNX [2] is a recent cross-industry eort, which we partici-
pate in, to standardize an IR for inference. e nGraph IR has a
richer feature set, including support for training and a rich set of
optimization passes and backends for execution. We will aim for
ONNX interoperability.
e activity in the space of deep learning compilers and IRs
highlights their need and we look forward to a healthy exchange
of ideas as the eld moves forward on these complementary eorts.
We believe nGraph can interoperate with developing standards
through framework bridges and nGraph backends.
2 INTERMEDIATE REPRESENTATION
An nGraph IR is a directed acyclic graph of stateless operation
nodes. Each node has zero or more inputs and zero or more outputs.
Nodes may have additional constant aributes that aect their
behavior, such as which axes to sum over. e inputs and aributes
of a node determine the shape and element types of the outputs.
Nodes operate on multi-dimensional arrays, called tensors. Most
frameworks associate semantics with particular axis positions. For
example, images might be stored in tensors ordered by mini-batch
size, channels, height and width. e framework-specic ordering
is usually a reection of op implementation and tensor element
layout. When dealing with video or high-dimensional time series
datasets, rank restrictions from framework ops require explicit
tensor reshaping and axis reordering. With the exception of tensors
directly accessible to framework users, nGraph, does not have a
xed relationship between axis order and tensor element layout.
3 FRAMEWORK BRIDGES
Frameworks use a framework-specic symbolic representation of
their computations, called a computational graph. Backends use the
graph to interpret or compile computations. e graph is also used
in the implementation of some form of the autodi algorithm for
the computation of derivatives, either by computing the derivative
directly on the graph, or by computing the graph for a derivative
computation from an existing graph. Framework bridges belonging
to the nGraph library use the graph to construct the nGraph IR.
Apache MXNet[1] is a core C++ library and a C API for interact-
ing with several frontend languages. Machine learning models may
be dened imperatively through Gluon or symbolically through the
standard MXNet frontend. Models’ operations are represented as
nodes in the NNVM graph. e MXNet-nGraph bridge translates
the NNVM inference graph into nGraph IR; it selects the largest
possible computation for the respective backend and uses autodi
on the nGraph IR for the derivative. Compiled nGraph functions
can then interface with the standard MXNet execution engines.
TensorFlow’s[4] XLA framework enables compilation and ex-
ecution of TensorFlow graphs on novel hardware such as NNP.
During the execution of a computation via XLA, an HLO IR of
the TensorFlow computation is sent to a device such as a CPU,
GPU or novel hardware. Our bridge plugin registers itself as a new
XLA device, maps HLO IR to nGraph IR, and returns a compiled
function. During the execution of the TensorFlow computation,
the function is invoked by TensorFlow on the input data and the
resulting nGraph output is returned.
For neon, we are creating a Python binding for the nGraph API,
which we hope to also use with other Python-based frameworks.
4 TRANSFORMERS
e IR generated by the nGraph library is passed to a transformer
for the generation of code optimized specically for the selected
backend. ese newly-optimized backends provide facilities for
paern matching, liveness analysis, memory management, and the
combining of tensor-element layout and shape management with
backend kernel selection.
e CPU transformer makes use of MKL-DNN, which produces
optimized sequences of calls to highly optimized kernels. Opti-
mizations provided by MKL-DNN are at a ner granularity than
those provided by nGraph. e CPU transformer will also be used
by other transformers for sub-graphs that use operations not sup-
ported by their backend.
Intel’s NNP processor is tailored for deep learning workloads.
Its transformer lets us make the fullest use of the hardware, falling
back on the CPU transformer for unsupported operations.
e cuDNN[3] transformer dynamically generates code that
links to the NVIDIA CUDA Deep Neural Network (cuDNN) library
for common kernels, such as convolution or somax. e nGraph
library then compiles portions of the graph into LLVM IR represen-
tation, and uses the PTX-emiing backend of LLVM to generate
the GPU assembly language PTX.
nGraph will natively support collective communication primi-
tives (AllReduce, Gather, Broadcast), as well as point-to-point prim-
itives as core graph ops. Transformers will generate the correspond-
ing communication library calls. Transformers will support vanilla
MPI, or provide optimized communication methods that impose
restrictions. An example of such a restriction would be: operate
only across a homogeneous cluster of a certain topology.
REFERENCES
[1] 2017. Apache MXNet. (2017). Retrieved January 4, 2018 from hps://mxnet.
apache.org/
[2] 2017. ONNX. (2017). Retrieved January 4, 2018 from hp://onnx.ai
[3] 2018. NVIDIA cuDNN. (2018). Retrieved January 4, 2018 from hps://developer.
nvidia.com/cudnn
[4] Martı´n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jerey Dean, Mahieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Georey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
berg, Dan Mane´, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vie´gas, Oriol Vinyals,
Pete Warden, Martin Waenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Intel® nGraph™ SYSML 2018, February 16, 2018, Stanford University
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
(2015). hps://www.tensorow.org/ Soware available from tensorow.org.
[5] Tianqi Chen, ierry Moreau, Ziheng Jiang, and Haichen Shen. 2017. TVM: An End
to End IR Stack for Deploying Deep Learning Workloads on Hardware Platforms.
(Aug 2017). hp://tvmlang.org/2017/08/17/tvm-release-announcement.html
[6] Google. 2017. XLA Overview. (2017). Retrieved January 5, 2018 from hps:
//www.tensorow.org/performance/xla/
[7] Chris Laner and Vikram Adve. 2004. LLVM: A Compilation Framework for Life-
long Program Analysis & Transformation. In Proceedings of the 2004 International
Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California.
[8] Amazon Web Service AI team. 2017. (Oct 2017). hp://tvmlang.org/2017/10/06/
nnvm-compiler-announcement.html
[9] Richard Wei, Vikram S. Adve, and Lane Schwartz. 2017. DLVM: A modern
compiler infrastructure for deep learning systems. CoRR abs/1711.03016 (2017).
arXiv:1711.03016 hp://arxiv.org/abs/1711.03016
