e Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual e ort. is issue is compounded by the proliferation of frameworks and hardware platforms. e current approach, which we call "direct optimization", requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires
INTRODUCTION
Deep learning frameworks are libraries that provide domain-speci c languages for de ning deep learning computations and APIs for managing data and executing computations. Backends' implementations are encapsulated so that the same computation can execute on multiple backends, such as CPUs and GPUs. We adopt the organization of compilers such as LLVM [7] by converting frameworkspeci c computation de nitions into a framework-independent intermediate representation (IR) that we compile into a form that can execute on the backend. An nGraph framework bridge acts as a framework backend. Each nGraph backend has a transformer that compiles or interprets the IR and provides an allocation and execution API that the framework bridges use to implement the framework's API. A key di erence between compilers for languages like C++ and compilers for deep learning frameworks is that with deep learning, the data being operated on is large and variable-sized, but highly amenable to parallelization.
Related Work
Google's accelerated linear algebra (XLA) [6] compiler acts as an experimental backend for TensorFlow [4] . Unlike nGraph, the XLA project has made no public comments about support of frameworks other than TensorFlow. [5] as an ahead-of-time compiler that supports multiple hardware platforms and interoperates with NNVM. NNVM leaves operator set unspeci ed, which makes di erent frontends and backends incompatible. nGraph, XLA, and LLVM use a xed, but extensible, IR operation set.
DLVM [9] is a project of the University of Illinois Urbana-Champaign. DLVM proposes an LLVM inspired modular IR with full control ow and a side-e ect free representation. It remains to be seen if the more exible IR is capable of supporting the performance optimizations enabled by simpler data ow graph IRs like those of nGraph, XLA, and NNVM.
ONNX [2] is a recent cross-industry e ort, which we participate in, to standardize an IR for inference. e nGraph IR has a richer feature set, including support for training and a rich set of optimization passes and backends for execution. We will aim for ONNX interoperability.
e activity in the space of deep learning compilers and IRs highlights their need and we look forward to a healthy exchange of ideas as the eld moves forward on these complementary e orts. We believe nGraph can interoperate with developing standards through framework bridges and nGraph backends.
INTERMEDIATE REPRESENTATION
An nGraph IR is a directed acyclic graph of stateless operation nodes. Each node has zero or more inputs and zero or more outputs. Nodes may have additional constant a ributes that a ect their behavior, such as which axes to sum over. e inputs and a ributes of a node determine the shape and element types of the outputs.
Nodes operate on multi-dimensional arrays, called tensors. Most frameworks associate semantics with particular axis positions. For example, images might be stored in tensors ordered by mini-batch size, channels, height and width. e framework-speci c ordering is usually a re ection of op implementation and tensor element layout. When dealing with video or high-dimensional time series datasets, rank restrictions from framework ops require explicit tensor reshaping and axis reordering. With the exception of tensors directly accessible to framework users, nGraph, does not have a xed relationship between axis order and tensor element layout.
FRAMEWORK BRIDGES
Frameworks use a framework-speci c symbolic representation of their computations, called a computational graph. Backends use the graph to interpret or compile computations. e graph is also used in the implementation of some form of the autodi algorithm for the computation of derivatives, either by computing the derivative directly on the graph, or by computing the graph for a derivative computation from an existing graph. Framework bridges belonging to the nGraph library use the graph to construct the nGraph IR. Apache MXNet[1] is a core C++ library and a C API for interacting with several frontend languages. Machine learning models may be de ned imperatively through Gluon or symbolically through the standard MXNet frontend. Models' operations are represented as nodes in the NNVM graph. e MXNet-nGraph bridge translates the NNVM inference graph into nGraph IR; it selects the largest possible computation for the respective backend and uses autodi on the nGraph IR for the derivative. Compiled nGraph functions can then interface with the standard MXNet execution engines.
TensorFlow's [4] XLA framework enables compilation and execution of TensorFlow graphs on novel hardware such as NNP. During the execution of a computation via XLA, an HLO IR of the TensorFlow computation is sent to a device such as a CPU, GPU or novel hardware. Our bridge plugin registers itself as a new XLA device, maps HLO IR to nGraph IR, and returns a compiled function. During the execution of the TensorFlow computation, the function is invoked by TensorFlow on the input data and the resulting nGraph output is returned.
For neon, we are creating a Python binding for the nGraph API, which we hope to also use with other Python-based frameworks.
TRANSFORMERS
e IR generated by the nGraph library is passed to a transformer for the generation of code optimized speci cally for the selected backend.
ese newly-optimized backends provide facilities for pa ern matching, liveness analysis, memory management, and the combining of tensor-element layout and shape management with backend kernel selection.
e CPU transformer makes use of MKL-DNN, which produces optimized sequences of calls to highly optimized kernels. Optimizations provided by MKL-DNN are at a ner granularity than those provided by nGraph. e CPU transformer will also be used by other transformers for sub-graphs that use operations not supported by their backend.
Intel's NNP processor is tailored for deep learning workloads. Its transformer lets us make the fullest use of the hardware, falling back on the CPU transformer for unsupported operations. e cuDNN[3] transformer dynamically generates code that links to the NVIDIA CUDA Deep Neural Network (cuDNN) library for common kernels, such as convolution or so max. e nGraph library then compiles portions of the graph into LLVM IR representation, and uses the PTX-emi ing backend of LLVM to generate the GPU assembly language PTX. nGraph will natively support collective communication primitives (AllReduce, Gather, Broadcast), as well as point-to-point primitives as core graph ops. Transformers will generate the corresponding communication library calls. Transformers will support vanilla MPI, or provide optimized communication methods that impose restrictions. An example of such a restriction would be: operate only across a homogeneous cluster of a certain topology.
