Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressiveness, composability, and portability. We present Relay, a new intermediate representation (IR) and compiler framework for DL models. The functional, statically-typed Relay IR unifies and generalizes existing DL IRs and can express state-of-the-art models. Relay's expressive IR required careful design of the type system, automatic differentiation, and optimizations. Relay's extensible compiler can eliminate abstraction overhead and target new hardware platforms. The design insights from Relay can be applied to existing frameworks to develop IRs that support extension without compromising on expressivity, composibility, and portability. Our evaluation demonstrates that the Relay prototype can already provide competitive performance for a broad class of models running on CPUs, GPUs, and FPGAs.
INTRODUCTION
Deep learning (DL) has radically transformed domains like computer vision and natural language processing (NLP) [Redmon et al. 2015; Young et al. 2017] . Inspired by these successes, researchers and companies are continually experimenting with increasingly sophisticated DL models and developing increasingly specialized hardware backends. DL frameworks for writing, optimizing, and compiling DL models reduce the complexity of these tasks, which in turn accelerates DL research and product development.
Popular DL frameworks offer different tradeoffs between expressivity, composability, and portability in the design of their intermediate representations (IRs) [Abadi et al. 2016; Bergstra et al. 2010 ; Relay 3 first-class functions, and references. In addition to improving expressivity, incorporating these features as language constructs allows optimizations to more readily compose.
• By representing DL models as functional programs, we reframe traditional DL framework problems as compiler problems. Backpropagation becomes a source code transformation, transforming an arbitrary Relay function into its gradient function; ad hoc shape inference becomes principled type inference; graph rewriting becomes program optimization; and the executor becomes (depending on what the context demands) an interpreter, virtual machine, or ahead-of-time compiler. Using this correspondence, we can adapt existing PL techniques to the DL framework domain.
• A notable example of this approach is Relay's type system (Section 3.3). Since operators have complicated semantics, shape inference is usually performed when shapes are fully concrete; however, at compile time, one does not have that luxury. We therefore extend a Hindley-Milner type system with type relations that encode shape constraints induced by operators. This allows Relay passes to reason about shape information at compile time.
• To provide portability, we define a platform-agnostic operator language and a compiler pass manager, which facilitates the development of passes that transform the IR to target new hardware backends.
We illustrate Relay's extension mechanisms through a series of case studies:
• We demonstrate that Relay subsumes the expressiveness of previous framework IRs by writing model importers for TensorFlow, PyTorch, MxNet, NNVM, and ONNX (Section 4.1).
• We provide a source-to-source AD algorithm that supports higher-order functions and higherorder derivatives, without the need for delimited continuations [Wang et al. 2018] 
(Section 4.2).
To improve the efficiency of this pass, we compose it with an effect-aware partial evaluator (Section 4.3).
• Using our optimization infrastructure, we define a polymorphic quantization pass that allows users to easily specify new quantization schemes and automatically generate and optimize quantized operators, instead of requiring manual adaptations for each bit width (Section 4.5).
We evaluate Relay on several systems (x86 CPUs, ARM CPUs, NVIDIA GPUs, Xilinx FPGAs) and over diverse vision and NLP workloads to demonstrate that (1) Relay enables composability of graph-level optimizations, (2) Relay delivers performance on inference tasks competitive with state-of-the-art frameworks (TensorFlow, PyTorch, MxNet) , and (3) Relay provides portability over difficult-to-compile-to hardware backends such as FPGAs.
Relay is an open source academic project 1 . It has been deployed at a popular web service provider, a telecommunications and consumer electronics manufacturer, and a social media company, among other companies. It is used to deploy efficient machine learning to both commodity and custom hardware with minimal engineering effort.
BACKGROUND
In this section, we provide a brief background on contemporary deep learning, popular deep learning frameworks, and state-of-the-art models being developed.
Deep Learning
Deep learning (DL) has been used to achieve state-of-the-art results in applications ranging from computer vision to game playing to natural language processing. Deep learning provides a collection of techniques for learning functions that are difficult to program directly.
4
Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock
For example, suppose one wants to construct a function f (x) to extract building addresses from images of a street. One approach to writing f (x) explicitly might be to write programs to identify regions of the image that contain information pertinent to addresses (like street names or house numbers), partition those regions into individual characters, and identify each character. One may then write a program that chains together many such modules into a working system. It would require hundreds, if not thousands, of human-hours of work and dozens of heuristics to write any of those programs by hand, and there is no guarantee it will perform well. By contrast, deep learning has, with relatively little code, been used to approximate this entire recognition task with a single model. The learned system outperforms all hand-crafted solutions [Goodfellow et al. 2013] .
A programmer solving a problem with DL performs three steps. First, she specifies a parametric function (often called a model, neural network, or simply network) F (θ, x) of the function that computes the desired value. She then trains the model by applying an optimization algorithm to find a set of parameters θ that results in an acceptable approximation of the function on a collection of input-output pairs. Finally, she uses her learned function to produce outputs for new inputs in a process called inference.
In practice, DL engineers write procedural programs that manipulate tensors, multi-dimensional arrays of numbers. These restrictions allow DL practitioners to take advantage of statistics, linear algebra, and real analysis. In order to search for an assignment of parameters that produces a good approximation, we must first define what a "good" approximation of f is and then find an algorithm that optimizes this metric. The quality of an approximation is typically defined in terms of some ground truth to compare against F (θ, x). This ground truth is usually provided in the form of a training set, a set of input-output pairs that has either been manually collected or extracted from an existing data set. The function that scores the output of a network with respect to a ground-truth label is called the loss function, L : (input, output) → network → R ≥0 , which typically takes in an input-output pair and network and produces a non-negative real number. One method to optimize the network is to take the gradient of the loss function composed with the candidate network for some input-output pair. The gradient of a function points in the direction of steepest change, so subtracting some proportion of the gradient from the current parameters reduces the loss of the network. This process is known as stochastic gradient descent (SGD), and it is the basis for many deep learning optimization algorithms used today.
More formally, training updates the model parameters using the following algorithm:
for some small, positive ε. This update cycle is repeated in training until the error is acceptable.
The final values of the parameters are then used for inference. While training and inference were once performed on the same machine, it is more common today to train models on high-powered devices such as a fleet of cloud machines or a custom GPU farm, since training is very computationally intensive. The values learned from training can then be deployed for inference on less powerful systems, such as a single GPU, a mobile phone, or an FPGA.
Deep Learning Framework Design and Limitations
Popular deep learning frameworks began with designs optimized for different tradeoffs between expressivity, performance, and portability. In the early days of deep learning, users would program in general purpose languages like Python and utilize scientific computing libraries like NumPy, which provides low-level operators such as matrix multiplication. Operators, also called kernels,
Relay 5
are dense linear algebra primitives like matrix multiplication, elementwise functions like tanh, and complex, domain-specific operations like image convolution. Operator execution dominates the execution time of a deep learning model: many operators are asymptotically slow and their input is often large. While a machine learning framework may be written as a library in a high-level language like Python or Swift, operators will typically be provided as opaque function calls to specialized implementations written in a lower-level language like C++.
In order to obtain better performance, researchers began utilizing specialized accelerators. To expose accelerators to end-users without needing to write in low-level languages, such as CUDA, researchers designed frameworks like Theano [Bergstra et al. 2010] . These frameworks represent computations using dataflow graphs (treating mathematical operations on data as nodes) and compile these graphs to deploy to accelerators like the GPU. "Computation graphs" provide a limited programming model, enabling efficient deployment. Computation graphs have since been adopted as the fundamental building block of modern machine learning libraries.
There are two dominant designs for computation graph-based frameworks. The first is the declarative design employed by TensorFlow. Such designs extend pure data-flow with ad hoc control operations to emulate the functionality of if and while. This approach is called define-thenrun and employs static computation graphs. A framework using a define-then-run representation has access to the entire graph before execution, providing more opportunities to optimize the program as well as simplifying deployment since the program can be executed without the host language. However, the control flow structures are less expressive than those supported by the host language, frustrating researchers who wish to experiment with complex models. For example, TensorFlow supports loops, but the elaborated graph has little resemblance to the input program, requiring optimization authors to reason about the elaborated form instead of familiar loops. The encoding also requires ad hoc, special purpose operators such as NextIteration and the addition of special control-edges to the graph. There are no generic mechanisms for users to define new control flow combinators (e.g., fold) or data types.
The second approach is used by PyTorch, where the host language (e.g. Python) dynamically constructs a computation graph. This approach is called define-by-run and employs dynamic computation graphs. An arbitrary host program executes dynamically, generating a computation graph as a by-product, allowing for use of all host language features. However, by not encoding control constructs in its IR, a framework using define-by-run cannot reason about or optimize control flow structures. Not only does this leave performance on the table, but it prevents deployment to certain edge devices, since they may have control flow or memory requirements that cannot be guaranteed by a fully dynamic control plane.
Dynamic Neural Networks
One area in which deep learning has made significant advances is natural language processing (NLP), such as finding keywords in a research paper, determining the sentiment of a tweet, or summarizing a news article. Reasoning about text requires context-sensitive analysis and data of non-fixed dimensions, unlike in many vision tasks. To allow for context-sensitive analysis, DL researchers have developed networks with persistent state, known as recurrent neural networks (RNNs). Like the simple networks described earlier, these networks have an input and an output; however, they take an additional input and produce an additional output known as the hidden state. Beginning with an initial hidden state and a list of inputs (for example, characters from a source text), one can produce a list of outputs (for example, a translation of the source text). Recurrent neural networks have found use not only in NLP, but also in speech recognition, music transcription, eSports, and other areas [Graves et al. 2013; Hochreiter and Schmidhuber 1997; OpenAI 2018] .
6
Unfortunately, since most machine learning frameworks rely on computation graphs, which cannot represent recursion, RNNs are usually finitely unrolled to a fixed depth. This may be acceptable if the depth can be determined statically and the loop unrolled ahead of time; however, if the depth depends on runtime values or complex control flow, unrolling must be performed dynamically.
Hardware Backends
If a programmer wants to experiment with a new hardware device, she must manually account for variations in hardware intrinsics, data types, and data layout. This is only possible if the device is supported by her framework of choice. Even with the addition of device-specific code, there is no guarantee performance will be acceptable, let alone optimal (or even that there will not be regression). Many existing IRs also do not support data-dependent control flow. If she cannot capture her model (e.g., an RNN) in the IR, she cannot deploy to hardware backends without requiring a host to drive execution. In order to effectively use these devices, engineers often redesign the model from scratch to better match the target platform's intrinsics and design.
DESIGN

Compiler Framework
The Relay pipeline can be split into three classic pieces: the frontend, where input formats are translated to Relay; the compiler, which type checks Relay ASTs, applies optimizations, and compiles operators; and the backend, where an execution mechanism is selected and available hardware accelerators are utilized.
3.1.1 Frontend. There are several ways to write an Relay program. A user can build an in-memory representation of a program in C++ or Python; parse one written in the Relay text format; or load one from the on-disk serialization format, similar in design to LLVM's bitcode. Models from popular frameworks, including TensorFlow, PyTorch, MxNet, Keras, and DarkNet, as well as interchange formats, such as ONNX, may be imported directly into Relay.
Compiler.
Once an Relay abstract syntax tree (AST) is produced, the program is optimized by applying a series of Relay-to-Relay passes. Between each pass, Relay performs type inference and checking, rejecting malformed programs as well as populating shape and type information that passes can utilize. Relay optimizations consist of both traditional compiler optimizations as well as domain-specific optimizations. Traditional compiler optimizations include constant folding, common subexpression elimination, and dead code elimination. DL-specific optimizations include operator fusion, quantization, layout transformation, and accelerator-specific optimizations. We discuss a subset of these optimizations in greater detail in Sec. 4.
Relay produces machine-specific code by decomposing the problem of code generation into multiple distinct phases. Since Relay is a high-level IR, it depends on a low-level code generator, such as TVM or Halide, to produce dense linear algebra kernels [Chen et al. 2018a; Ragan-Kelley et al. 2013] . We use TVM in our experiments. Low-level kernel compilers focus on generating highly efficient operators. The generated kernels have a fixed calling convention and do not handle allocation. Instead, they expect inputs and outputs to be preallocated. From an optimized AST, the compiler extracts a set of Relay operators, translates them to TVM expressions, and then compiles to available hardware targets. The resulting output is an object file that contains the compiled operators and an Relay program that invokes these primitives. In our prototype implementation, we are able to target CPU, GPU, iOS and Android mobile devices, custom accelerators, and FPGAs.
Relay 7
Expr e ::= %l
(call) | let %l (: τ )? = e; e (let) | e; e (let %_ = e; e) | %graph = e; e (graph let) 3.1.3 Backends. After primitive operators are lowered, the remaining Relay program is the glue that ties together operator invocations, allocation, control-flow, recursion, and high-level data structures. There are multiple options for executing the combined full program: the Relay interpreter (with JIT compilation), the TVM graph runtime, and an experimental Relay ahead-of-time compiler (discussed in Sec. 4.7) that converts programs to C++. 
IR
The Relay IR is a high-level, functional, differentiable language. One can understand Relay by starting from a subset of Relay that represents an idealized computation graph IR and incrementally growing to the full Relay IR. A computation graph, in its simplest presentation, is a directed acyclic graph with multiple inputs and a single output. The syntax of an equivalent computation graph is realized by a language with three rules (1) variables, (2) function calls, and (3) operators, see Figure 1 for the corresponding rules.
3.2.1 Multi-Output. This subset lacks useful features that are present in IRs used in practice. For example, common operators such as split, which splits a tensor along a particular axis, require multiple outputs. In order to handle these programs, computation graph IRs have added primitive support for multiple outputs. Multiple outputs can be modeled as tuples, which can be added with just two rules (1) tuple formation and (2) tuple projection.
3.2.2 Let. By construction, computation graphs enjoy implicit sharing of subcomputations via multiple outgoing dependency edges. Implicit sharing is useful for both execution and analysis, because it enables users to uniquely identify subgraphs. Previous frameworks often obtain sharing by using a host language's name binding to construct a graph. General purpose programming languages, on the other hand, provide explicit sharing via binding constructs, such as let. In programs free of scope, ordering, and effects, implicit sharing and explicit sharing are semantically equivalent. However, in the presence of these three, implicit sharing does not adequately preserve the semantics of effects, since their ordering is not well-defined. Since user programs contain scope, ordering, and effects in practice, previous systems have been forced to provide workarounds. For example, TensorFlow's eager mode inserts dummy control edges in its generated graphs to impose an ordering on effects. The lack of lexical scope in traditional graphs complicates language features such as first-class functions and control-flow [Moses 1970; Sandewall 1971] . The lack of explicit scoping information also weakens the ability to provide precise versions of traditional analyses, such as liveness. The addition of a humble let binding, well studied in programming languages, enables explicit sharing and provides an elegant solution to the problems outlined above.
3.2.3 Control Flow. Neural networks increasingly rely on control flow, forcing frameworks based on computation graph IRs to support this construct; however, control flow extensions are generally ad hoc. Even in the presence of control flow-free models, looping constructs are necessary to implement optimization algorithms such as SGD. Furthermore emerging architectures are beginning to make greater use of control flow, with many architectures exposing custom control combinators such as loops, maps, folds, and scans. The central challenge is a flexible and extensible encoding of control flow operations. The functional programming community has demonstrated recursion and pattern matching are sufficient to implement arbitrary combinators for control flow and iteration. To support the definition of functional loops we enrich Relay with two more language features to implement arbitrary combinators: if and first-class recursive functions. The guard expression in Relay's if expression operates over rank-0 boolean tensors, which represent booleans.
First-Class Functions.
A computation graph is a single expression from multiple inputs (i.e. its free variables) to multiple outputs. While it may be tempting to reinterpret a graph as a function, it lacks functional abstraction and named recursion. Adding the ability to name functions and pass them as first-class values dramatically increases Relay's expressivity, allowing it to encode generic higher-order functions and readily use techniques used in functional compilers like automatic deforestation. First-class functions also enable passes such as automatic differentiation, which we discuss in Section 4.2, and the simpler implementation of importers that map higher-level programs 3.2.5 Data Abstraction. Earlier, we extended the language with tuples to emulate behavior of existing IRs. Deep networks require additional data types like lists, trees, and graphs [Karpathy 2015; Liang et al. 2016; Tai et al. 2015] . Relay borrows a generic and principled way to extend a language with new data types: algebraic data types (ADTs). To support them we add (1) a type declaration mechanism and (2) pattern matching. One may question why Relay has if when it could be subsumed by match: if is still necessary because tensors are primitives, not ADTs. The resulting language is a familiar strict functional language, resembling the core of languages like OCaml and SML. Our language makes domain-specific deviations from existing work, and we have provided a full listing of its syntax, operational semantics, and type rules in the appendix. A functional language provides a few notable advantages. Its pure fragment represents idealized computation graphs free from effects. This fragment can be easily optimized by end users who can reason about it as pure dataflow. For this reason, Relay is pure by default but exposes a limited form of mutation via ML-style references that we have primarily used for automatic differentiation (see Sec. 4.2) .
Relay is more expressive than many previous frameworks and this expressivity introduces new challenges. Previous essential functionality such as shape inference and automatic differentiation must be adapted for our new IR. How does one reason about the shapes of operators when the input is unknown? How does one backpropagate over pattern-matching, control, data types, and 10 Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock ∆; Γ ⊢ e : τ Expression e has type τ in type context ∆ and variable context Γ.
Examples of Relay's typing inference rules, namely the rules for function definitions and function calls, where ∆ is the environment for types and Γ is the environment for variables. These demonstrate that type relations must hold at each call site.
mutation? In the following subsection we demonstrate how one can adapt techniques from type inference and checking to Relay, and in Section 4 we examine how to adapt other necessary functionality such as automatic differentiation and operator fusion.
Type System
Computation graph IRs rely on typing in the form of datatype and shape inference. Datatype and shape inference is the process of computing the concrete datatypes (e.g., float32, int32) and shapes (e.g., (10, 5), (100, 1, 32)) of all tensors in a computation graph. Deep learning frameworks and compilers use static shape information to perform allocation, check correctness, and facilitate optimization. Precise static shape information is also valuable for traditional loop optimizations, data layout transformations, tensorization, and optimizations that are necessary to map to hardware accelerators' unique ISAs. Shape inference is usually formulated as a simple analysis over the dataflow graph that propagates shape information. Shape inference looks remarkably similar to type inference. Unlike type inference, though, shape inference is separate from the type system and does not provide types for functions or data structures. Handling shape inference at compile time is desirable, because it allows optimizations to take advantage of this information even though certain shapes may be symbolic. Can shape information be encoded in static types? It is possible to model arbitrarily complex static properties, such as shape information, with dependent type theory, but such a design incurs significant user complexity. Relay's type system is designed to balance the desire for static tensor shapes without limiting the language's expressiveness. In this subsection we describe how to extend a polymorphic type system with shape information and type inference with shape inference.
3.3.1 Tensor Types. The primitive value in Relay is a tensor, which has type T ensor [s, bt] where s is a shape and bt is a base type. Elements of base type are floating point numbers and integers of Relay 11 specific bit widths and number of lanes. This design decision is inspired by LLVM, which supports arbitrary-width integer types. The parameterization by lanes helps represent vectorized data types, which are supported by many CPUs and hardware accelerators. To ensure Relay can offload tensor computation to devices with greatly varying architectures, Relay's kinding rules only permit tensors to contain base types, preventing, for example, tensors of closures.
The shape of a tensor is a tuple of integers describing the tensor's dimensions. In general, these dimensions may depend on arguments to an operator. A dimension may be a variable or arithmetic expression that indicates how the output shape of an operator depends on those of its inputs. Functions may be polymorphic over shapes, which results in shape constraints that must be solved during type inference. Sec. 3.3.3 describes the process. Relay also supports a special shape called Any, which is used to indicate that we do not have static shape information about a particular dimension.
Operators and Type Relations.
A difference between general purpose programming models and those tailored to deep learning is the use of operators as the primitive unit of computation. The ability to add new operations to Relay requires a type system that can adapt to complex shape relationships between input and output types. Many operators have types that can be defined as functions of the input types. Unfortunately some are not only functions, but also relations that specify constraints between input and output shapes. A key extension of Relay over traditional type systems is the addition of type relations to express these constraints. When developers add a new operator to Relay, they may constrain its type with existing relations or add their own. Function types (including those of operators) may include one or more type relations over an arbitrary subset of the argument types and the return type. The type checker enforces that these relationships hold at the call site. These relations may be viewed as a verification condition induced at a function call site, where the formula is a conjunction of the relations. For example, primitive operators are assigned types that are universally quantified over both the input and output types. We can then use a type relation to encode a constraint that must hold later when type checking observes specific input and output types. Type relations are opaque in the Relay IR: they are implemented in the meta-language and registered when defining an operator. However, they may be reused across different implementations. For example, we use a relation that describes the broadcasting rule for all elementwise operations.
3.3.3 Type Inference. The most interesting parts of the type system are where shape computation occurs. We highlight a few examples of Relay's inference rules in Fig. 3 ; the full typing rules can be found in the appendix. In this subsection we focus on design decisions behind Relay's type system and the implementation of type inference.
To incorporate type relations into Relay's type system, we enrich a Hindley-Milner type inference algorithm with a constraint solver for type relations. Relay's inference algorithm has three steps: first it performs a pass over the AST generating types (potentially involving type variables) as well as populating the set of relations, then it solves the incurred constraints, and finally it assigns types to each expression in the AST. A type relation is implemented as a function in the meta-language and represents the symbolic relations between the input and output types of an object-language function.
When the type inference algorithm visits a function call site, the function's type relations are instantiated to the types at the call site and added to a queue of relations waiting to be solved. The relationships between the call's type variables and its relations are are added to a bipartite undirected dependency graph where the two disjoint sets are type variables and type relations. Traditional unification constraints are represented using a modified union-find structure that integrates with the dependency graph.
12
Once the queue is populated, the algorithm will dequeue a relation and attempt to solve it. There are two cases when solving a type relation:
(1) If all the relation's type variables are concrete, we call the type relation function. If that function returns true, the constraint is discharged. Otherwise, type checking fails. (2) If any type is fully or partially symbolic, the algorithm will propagate existing concrete type information via unification. All relations affected by new assignments to type variables (as determined by the dependency graph) are moved to the beginning of the queue. If the current type relation is now completely solved, we discard it to avoid unnecessarily visiting it again.
Our fine-grained dependence graph provides the transitive dependencies between relations and unification variables. The use of fine-grained dependencies enables our algorithm to only retry a minimal number of relations when we learn a new variable assignment. We run this to fixpoint or until the queue is empty. If the queue is not empty and no progress is made between iterations, then at least one variable is underconstrained and inference fails. Note that a type relation's implementation can compromise type soundness, as they are axiomatic descriptions of operations implemented outside of Relay. Luckily, in practice, the number of type relations needed to express most of Relay's operators is relatively small, and their implementations are generally straightforward and amenable to exhaustive testing.
CASE STUDIES
In this section we use Relay's design to implement the following: a model importer from popular frameworks, a generic automatic differentiation algorithm, a partial evaluator, a generic operator fusion algorithm, an automatic quantization framework, a set of optimizations for deep learning accelerators, and an ahead-of-time compiler.
Supporting Existing Frameworks
We evaluated Relay's expressivity by translating common frameworks and interchange formats to the Relay IR. Models can be imported from all major frameworks including TensorFlow, PyTorch, MxNet, Keras, as well as interchange formats such as ONNX. Many frameworks have static computation graph-based representations, which are straightforward to translate to Relay. Each operator is mapped to an operator of the appropriate type, with multiple output represented as tuples. A greater challenge is translating frameworks with a richer computation model such as TensorFlow. TensorFlow supports control flow and includes TensorArray, a write-once Tensor container. We can extract the loop structure out of the TensorFlow graph, converting it to a Relay loop, and transform the TensorArray into a Relay list. There are also new IRs adapting approaches similar to Relay (see Section 6). We believe it will be possible to translate these to Relay once they become stable.
Higher-Order, Higher-Order Automatic Differentiation
An important property of machine learning computations is differentiability. Optimization algorithms like stochastic gradient descent require computing gradients of functions (Section 2). Previous automatic differentiation (AD) techniques used on computation graphs cannot be directly applied to Relay due to new language features such as closures, recursion, and control flow. Furthermore, it is becoming increasingly important to compute not only first-order gradients of functions but potentially nth-order gradients [Chen et al. 2018b; Liu et al. 2018] . Relay requires an algorithm that operates as a source code transformation and supports higher-order functions and higher-order gradients. There is a large body of work on performing AD of programs. Several full-length papers have been written about automatic differentiation (AD). This section highlights the important differences between our approach and other recent work. Previous approaches like
Relay 13
Lantern [Wang et al. 2018 ] (see Section 6) define a generic and powerful version of AD. Lantern uses delimited continuations to implement AD. Delimited continuations are an elegant solution but require language and compiler support, as well garbage collection. Furthermore, delimited continuations are challenging to reason about when performing optimization.
Our AD algorithm is conceptually similar to Lantern's, with a few key differences. First, our algorithm is defined as a source code transformation. Given an expression, Relay produces a corresponding expression that computes its gradient. Figure 4 provides a denotation from Relay expression to Relay expression that defines our AD algorithm. Second, our algorithm eschews delimited continuations in favor of an approach using closures and references. Relay simply pairs all tensor values with a reference that tracks its partial derivative with respect to its output. This form of reverse mode AD is similar to how one would implement forward mode AD. Relay lifts all tensor-typed values to a pair, an expression of some tensor type T becomes a tuple of (T, Ref<T>) where the second component contains the sensitivity variable needed to compute the partial derivative. For each gradient function generated, Relay allocates a single reference which stores the "backpropagator," a closure which propagates gradients from the output to the input. Each subcomputation affecting the gradient updates this closure; when it is finally executed, the built-up closure returns the final derivatives with respect to to the arguments.
As described in Figure 4 , only computations involving tensors contribute to the gradient. For example, we support mutability for free because mutation does not affect the gradients. In this sense, our design is simple. All tracing needed to compute derivatives is done at run time, enabling support for higher order functions, higher order gradients, data-dependent control flow, and mutability without requiring changes to the algorithm. Finally, Relay exposes this transformation as an operator, allowing users to compute the gradient of a function f simply by writing grad(f).
Many other variants of AD, including algorithms with different complexity bounds (e.g., forwardmode AD), exist. Forward-mode AD is useful for computing the Hessian vector product, which is necessary for techniques like differentiable architecture search (DARTS) [Liu et al. 2018] . Because our AD algorithm just another Relay pass, it is possible for users to implement and experiment with different AD techniques without changing the system. To this end, we also implemented a forward-mode AD algorithm using the traditional method of dual numbers [Baydin et al. 2015] . Both forward-mode and reverse-mode AD are higher-order and extensible: they support closures, abstract data types, control flow, and recursion. Although we have not investigated composing forward and reverse modes, it is possible to mix gradient functions because they are regular Relay functions. Because our algorithm enjoys a closure property, we can perform AD over the composition of the gradient functions.
Partial Evaluator
In order to handle differentiating the full IR, our AD algorithm makes use of closures and references. However many of the programs are effectively first-order and do not require allocating references or a backpropagator closure. It is essential we remove unnecessary uses of closures and references as they inhibit optimizations like operator fusion. Previous approaches have used staging to manually phase computation, but this requires modifications to the language itself. A partial evaluator (PE) allows the use of high-level abstractions without limiting code that could in practice be compiled to a particular target. The benefits of partial evaluation do not only extend to code generated by AD but for all of Relay. Relay's partial evaluator works by defining a interpreter where the value domain is partially static values. The partially static domain represents simple values, such as constant tensors, as themselves. The representations of aggregate values mirror their structure, enabling values which are a mixture of static and dynamic. This makes the partial 14 Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock
. . ., evaluator more powerful than a constant-folding pass. The appendix presents an implementation of PE.
There are two important features of our partial evaluator: managing effectful computations and handling references. In order to handle effects, we keep the generated program in A-normal form to ensure effects are properly ordered and to avoid the duplication of effectful computations. The partial evaluator supports references by simulating the store at partial evaluation time. The explicit store is threaded throughout execution and is managed to achieve flow sensitivity. After evaluation we construct a new program with static subcomputations evaluated away. The reconstructed program contains all original expressions, as well as evaluated expressions, because interleaving dead-code elimination (DCE) is non-trivial. Afterwards, we separately apply DCE. The result of this entire process is illustrated in Figure 5 .
Operator Fusion
Operator fusion is an indispensable optimization in deep learning compilers. Operators are realized as sets of loop nests that may consume and produce a number of tensors. Performing loop fusion enables better sharing of computation and reduction in allocations, memory consumption, and the number of intermediates that must be manifested in memory. Many frameworks perform operator fusion, but common approaches have several limitations. In one approach operator fusion works over fixed pairs of operators, requiring implementations that are fused and tuned by hand for each pair. An alternative approach is to perform limited code generation for a specific class of operators such as elementwise operations, which requires a code generation template as in TensorFlow XLA. These designs do not provide user-defined operations the same optimizations as vendor-provided operators. For example, if a fused implementation does not exist in CuDNN, a GPU operator will . Example of running the compiler pass pipeline for AD on the identity function. First, we run the base AD pass on the original function (described in Section 4.2). Then, we run the partial evaluator, which primarily optimizes away the reads and calls in %x2 and %x3 in post-AD. Since it conservatively determines whether a subexpression is effectful, it generates many bindings which are dead code. At this point, we run the dead code elimination pass to crunch the code back down. remain unfused. By contrast, Relay's algorithm leverages our low-level operator IR to perform fusion over an arbitrary sequence of operators, with arbitrary shapes and data types. Importantly, user-defined operators are no different from our standard operators, both of which are defined using TVM. We can generate new fused operations for operators that are eligible for fusion. Relay is not just limited to pairwise fusion; it can fuse arbitrary chains of operators including ones with multiple outputs and non-linear consumer/producer patterns. The fusion algorithm works in two passes, described below.
4.4.1 Extraction. First, Relay identifies subexpressions that contain sequence of operators invocations that can be fused. Relay then factors a sequence of operator invocations into a function that it marks as primitive. Relay identifies operations eligible for fusion by constructing a directed acyclic graph (DAG) representing data flow between operators. It then builds a post-dominator tree from the DAG. The tree construction is straightforward due to the lack of cycles. Relay groups subexpressions and their dependencies into equivalence classes by their immediate post-dominator. Relay then constructs a final expression from each equivalence class, collects the expression's free variables, converts it to a function, and marks it as primitive.
16
Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock 4.4.2 Lowering. In the second step, the compiler converts a primitive function into low-level code that can then be compiled for all supported platforms. It achieves this by combining the computation of each operation into a new TVM expression representing the fused operator. Relay then computes a master schedule based on the set of operations being fused. By combining the master schedule with the fused computation, Relay is able to produce an optimized version of the operator for any platform supported by TVM. A challenging case handled by our fusion algorithm is correctly fusing operator graphs with diamond-shaped branches.
Generic Quantization
The memory and data types required by deep learning introduce difficulties when deploying neural networks to resource-limited devices, such as mobile, IoT, and other edge devices. An emerging area in deep learning is performing training and inference on non-standard numeric types to improve throughput and memory usage. For example, a single neural network may have more than one million floating-point values as its parameters. The sheer quantity of parameters and their datatypes may limit the ability to execute these networks on hardware accelerators. Accelerators often support fixed point or other non-standard datatypes, at lower precision. In order to target these devices, Relay must map the computation to the appropriate domain. State-of-the-art work on quantization suggests that there exist a number of tradeoffs between different quantization techniques, with the best often determined by platform and model type [Krishnamoorthi 2018] . Current deep learning frameworks support a limited number of quantization schemes, and options because quantization requires framework support in the form of custom platform-specific operators. Importantly, there are many different choices of quantization mechanisms. Each type of quantization has different running time and accuracy properties depending on the model as well as the target hardware. Existing frameworks manually choose a fixed quantized data format, which might be suboptimal. Instead, Relay includes a generic, compiler-based quantization flow that supports a diverse set of quantization mechanisms and automatically generate code for each of them. Relay's generalizable and flexible quantization workflow can support customization in both standard devices and acceleration schema and address various constraints across different hardware platforms. The pipeline that we designed can compress and accelerate neural networks with low-precision quantization to enable running the deep learning models on edge devices.
The generic quantization flow proceeds in three steps: annotation, calibration, and realization. We can apply this pass to convolution-like operators which have a quantized schedule available. Figure 8 shows a graphical visualization for this.
Annotate. Annotation rewrites the graph and inserts simulated quantization operations according to the rewrite function of each operator. The simulated quantize operation simulates the rounding and saturation error of quantizing from float to, for example, 8-bit integer. However, it is computed with a 32-bit float data type, which is convenient for the calibration pass and debugging. See the definition of simulated quantize, as in Figure 8 . We annotate the inputs to this operation with an operator simQ, which simulates the effect of quantization (for example, from a 32-bit floating point value to an 8-bit integer value). simQ has a set of parameters that must then be calibrated in order to correctly quantize the graph, namely the bits, the scale, and the range. Finally, after the algorithm has selected appropriate setting for these parameters, it applies realization, which transforms the simulated quantization operator into the numerator.
Calibrate. The simulated quantized operations have a set of parameters which must be calibrated in order to correctly quantize the graph to achieve the minimal decrease in accuracy. Q(x, r, bit, siдn) = cast(clip(round(x/r * 2 bit −siдn ), int8))
Quantization for TVM
(1) Fig. 6 . The quantization operation.
simQ(bits, siдn, ranдe) = clip(round( x r * 2 bit −siдn )) * r 2 bit −siдn (2) Fig. 7 . The simulated quantization operation. Fig. 9 . An example of overloading the annotation function for 2-d convolution. In this example we treat both input, and the weights as unsigned integers, applying rounding to the input, and stochastic rounding to the weights.
Realize. The realization pass transforms the simulated quantized graph (which uses 32-bit floats) into a real low-precision computation graph. The simulated quantized operator is transformed into several fine-grained operations like multiplication and addition.
Developers can customize quantization with very little code. For example, the quantization annotation function may be overloaded for any operation, making use of signed or unsigned integers, or different rounding strategies, such as floor, ceiling, or even stochastic rounding (Figure 9) .
To demonstrate the effectiveness of our generic quantization, we use Relay to explore different choices of input and accumulation bits. The results are shown in Table 2 . We find that no single strategy fits all use cases. For certain network architectures such as MobileNet and ResNet, 16-bit to 32-bit quantization provides good accuracy, but 8-bit to 16-bit provides the best speedup, assuming Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, Logan Weber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock the input does not overflow. The current framework demonstrates the importance of applying various quantization schemes based on networks and hardware platforms.
Additional Optimizations
This subsection focuses on a subset of optimizations necessary to compile Relay to deep learning accelerators, as well as additional optimizations. Deep learning accelerators have restricted computing models and cannot directly execute full Relay programs. For example many do not support executing unbounded loops, and require some computation to be scheduled on a general-purpose host device such as the CPU.
FoldScaleAxis is an optimization that removes scaling operations that occur before or after convolution-like operators. It does this by moving the scaling across other operators until the scaling is directly applied to a weight. The scaling factor can then be directly folded into the weights of the network. This optimization is required for certain accelerators, such as VTA [Moreau et al. 2018] , that lack scalar multipliers. In order to target VTA we must eliminate all scalar operations.
CombineParallelConv2d is a specialized optimization that fuses multiple convolutions that share the same input. The goal of this pass is to produce a larger kernel for the GPU, as each kernel launch on the GPU has overhead. It was designed with the Inception network [Szegedy et al. 2015] in mind, which contains blocks of convolutions that share the same input.
The entire CombineParallelConv2d pass, including documentation and tests, required fewer than 350 lines of code and was contributed by an non-Relay affiliated undergraduate student, in their first contribution to our codebase.
Ahead-of-Time Compiler
Finally, we implemented an ahead-of-time (AoT) compiler for Relay to demonstrate the pass system's flexibility and make use of traditional compiler optimizations. Relay's default execution mechanism is a simple recursive AST traversal-based interpreter that uses JIT compilation to execute operators. The interpreter is useful for testing and debugging, but is not optimized for efficient execution.
The AoT compiler converts an Relay program to C++ in several steps. First, it applies several standard Relay passes, such as fusion, to produce optimized binaries for the operator calls in the program. The Relay program is converted to A-normal form and translated to C++. Operator calls are replaced by invocations of the binaries produced in the first step. The only Relay AST construct that does not have a direct analogue in C++ is the pattern matching expression, which is compiled to C++ continuations.
EVALUATION
We evaluate Relay's ability to support a wide variety of models without sacrificing composability, portability, or performance. In particular, our evaluation is composed of three parts:
(1) Relay enables composable optimizations: Relay supports composing program transformations into multiple optimization tiers. (2) Relay provides competitive performance: Despite increasing expressiveness, Relay's performance is competitive with the state of the art on popular models. (3) Relay handles challenging backends: Relay can compile models to execute efficiently on a variety of backends, such as FPGA accelerators, which require quantization, layout optimizations, and bit-packing transformations.
We evaluated the following vision models: Deep Q-Network (DQN), a DNN that achieved state-ofthe-art performance on 49 Atari games in 2015; MobileNet, a DNN designed for image recognition Relay 19 on mobile and embedded devices; ResNet-18, a DNN for image recognition that achieved state-ofthe-art performance on ImageNet detection tasks in 2015; VGG-16 (named for the Visual Geometry Group at Oxford), a CNN used for image recognition tasks [He et al. 2015; Howard et al. 2017; Mnih et al. 2013; Simonyan and Zisserman 2014] .
We evaluated the following NLP models: CharRNN, a generator character-level RNN from a PyTorch tutorial; TreeLSTM, a generalization of LSTMs to tree-structured network topologies; RNN, GRU, and LSTM, a selection of models from the Gluon model zoo [Gluon Team 2019; Robertson 2017; Tai et al. 2015] .
Experimental Methodology
Because we only evaluate inference in this paper, we frequently make use of random inputs to models when measuring performance. There were two exceptions where we evaluated on real data because it was readily available: CharRNN and TreeLSTM.
Our vision experiments from Section 5.2 and Section 5.3 were run on a machine with an AMD Ryzen Threadripper 1950X 16-Core CPU, an NVidia 1080 Ti GPU, and 64 GB of RAM. Our NLP experiments from Section 5.3 were run on a machine with an AMD Ryzen Threadripper 1950X 16-Core CPU, an NVidia Titan-V GPU, and 64 GB of RAM. Our low-power vision experiments from Section 5.4 were run on multiple edge-class ARM development boards: a RaspberryPi 3, a Firefly RK3399, and an Ultra-96 FPGA platform.
We evaluated Relay's handling of accelerators on VTA [Moreau et al. 2018] , the versatile opensource deep learning accelerator. We implemented a VTA design with a 16 × 16 matrix-vector 8-bit tensor core clocked at 333MHz on the Ultra-96 platform.
In terms of software, we used Cuda version 10.0, CuDNN version 7.5.0, TVM commit cefe07e2a 2 , MxNet version 1.4.0, Pytorch version 1.0.1post2, and TensorFlow version 1.13.1.
The Relay vision experiments utilized aggressively tuned TVM schedules on the GTX 1080 Ti GPU, improving performance significantly. 1.00 1.00 1.00 1.00 Fig. 10 . Speedup from increasing the number of graph transformations in Relay (-O1, 2, 3), relative to no optimizations at all (-O0). We show that, by composing passes, we can monotonically improve performance on vision benchmarks running on the NVIDIA GTX 1080 Ti.
Relay Enables Composable Optimizations
We demonstrate that Relay can facilitate composable optimizations, by evaluating vision workloads under incremental optimization levels, denoted -On:
• -O0 does not apply any program transformation passes.
• -O1 applies an operator fusion pass.
• -O2 additionally applies constant folding, using Relay's interpreter to evaluate away operations on constants.
20
• -O3 additionally applies four more passes: (1) FoldScaleAxis, which folds scaling operations into the axis options of other operators, (2) AlterOpLayout, which alternates operator layouts for better cache performance, (3) CanonicalizeOps, which canonicalizes the "bias add" operator in terms of expanding dimensions and broadcasting for further analysis, (4) CommonSubexpElim, which lifts common subexpressions. Figure 10 shows mean inference speedup relative to -O0 as Relay applies optimizations more aggressively. Average performance improves by up to 2× when all optimizations are applied. Most networks benefit greatly from operator fusion. Nature-DQN [Mnih et al. 2013] has simple operators, which don't benefit from optimizations such as layout transform, explaining why its performance doesn't improve beyond -O1. ResNet-18 [He et al. 2015] and VGG-16 [Simonyan and Zisserman 2014] are two dense convolutional neural networks which benefit from -03 optimizations. These networks contain dense conv2d operators that benefit from the AlterOpLayout pass. Overall, these results show that Relay lets us compose optimizations in a way that is beneficial to diverse workloads. Fig. 11 . Inference slowdown of popular frameworks relative to Relay on vision benchmarks running on NVIDIA GTX 1080 Ti GPUs. Relay provides performance competitive to the state of the art. We ran 1000 trials for each model and used the AoT compiler.
Relay Provides Competitive Performance
An age-old story in compilers literature is that increasing expressivity impacts the global performance of the system. We set out to build zero-cost abstractions for Relay, governed by Stroustrup's principle, "What you don't use, you don't pay for" [Stroustrup 2004 ]. We demonstrate that we can achieve competitive performance on both CPUs and GPUs on a wide set of CNNs thta are well supported by existing frameworks. We evaluated inference time for two classes of workloads: computer vision and natural language processing. We compared Relay (using our AoT compiler) to NNVM, TensorFlow, TensorFlow-XLA (Accelerated Linear Algebra), PyTorch, and MxNet. We ran the vision and NLP workloads on GTX 1080 Ti and Titan-V GPUs, respectively. Vision Evaluation. Figure 11 compares Relay against state of the art frameworks running vision workloads on a GTX 1080 Ti GPU. We ran each model with batch size 1, a common setting in inference tasks. Relay achieves performance on par with NNVM, an existing deep learning graph compiler in use at Amazon. Relay outperforms TensorFlow, TensorFlow-XLA, MxNet and PyTorch on every benchmark. Relay's ability to do aggressive optimizations like operator fusion on long chains of operations, generating hardware specific implementations, enables it to outperform existing frameworks that don't perform inter-operator optimizations. Inference slowdown relative to Relay on NLP benchmarks running on NVIDIA Titan-V GPUs. NLP workloads feature control flow, which makes them more challenging to optimize. Relay provides performance competitive to state of the art (up to 2.4× speedup over MxNet on GRU). We ran 1000 trials for each model, except for CharRNN, on which we used 100 trials.
NLP Evaluation. Figure 11 compares Relay against state-of-the-art NLP models on a Titan-V GPU. Implementations of the NLP models were not available in all frameworks; we used MxNet baselines for RNN, GRU, and LSTM and PyTorch for Char-RNN and TreeLSTM. Relay performs better than MxNet on recursive models due to the fact they are implemented in Python using MxNet's looping constructs. PyTorch instead uses handwritten and heavily optimized C implementations of the recursive network cells. Due to this we perform slightly worse than PyTorch. It is interesting to note that our pure Relay implementation performs competitively against the hand-optimized version. int8/int16 int8/int32 Fig. 13 . Inference time (ms) of vision DNNs on low-power platforms using different data types. Relay allows us to reduce inference time on power-constrained devices by easily substituting float32 multiplications with int8 multiplications and int16 or int32 accumulations (denoted at int8/int16 and int8/int32 respectively). We used 1000 trials for each model.
Relay Handles Challenging Backends
Relay can handle challenging scenarios: consider edge inference where energy is a first order constraint due to thermal limitations or limited battery life. One option is to apply more aggressive quantization: instead of performing expensive arithmetic in the floating point domain, simpler and narrower fixed point data is used. Another option is hardware acceleration: instead of evaluating compute-intensive operations on the CPU, we can offload to a specialized accelerator.
Quantized Inference on ARM CPUs and GPUs. We evaluate the effects of quantized inference applied by Relay on vision workloads running on the Raspberry-Pi3 and Firefly RK3399 ARM-based platforms. Figure 13 shows the effects of different levels of quantization applied to low-power devices (the details of how quantization is implemented, and how it affects classification accuracy is described in Section 4.5). The numbers show that as we opt for a more aggressive quantization scheme such as int8/16 (i.e. 8-bit multiplication and 16-bit accumulation), we achieve much improved performance.
Targeting Deep Learning Accelerators on FPGAs. We evaluated inference time on five models including MobileNet-G [Howard et al. 2017 ], a grouped variant of the MobileNet architecture; ResNet-18, ResNet-34, and ResNet-50[He et al. 2015] ; and Deep Convolutional Generative Adversarial Networks [Radford et al. 2015] , a generative DNN used in unsupervised learning. Overall, Relay helps us efficiently offload deep learning operators onto specialized accelerators like VTA. Our results in Figure 14 show that we can achieve between 2.5 to 11.7× reduction in single-batch inference latency by offloading critical operators to the FPGA accelerator. These experiments demonstrate Relay's ability to target current and future deep learning architectures:
(1) Heterogeneous FPGA/CPU offloading: Relay lets us define the rules for offloading specific operators to the FPGA-based accelerator. (2) Push-button quantization: Relay can take a fp32 model and convert its parameters to int8 in order to enable inference on specialized accelerators. (3) Accelerator-friendly data packing: Relay reorganizes data so it can be effortlessly consumed by a specialized TPU-like accelerator [Jouppi et al. 2017] .
RELATED WORK
In this section we will focus on three areas of related work: frameworks with computation graphbased IRs, libraries for low-level code generation of tensor functions, and prior work in programming languages.
Frameworks. Computation graph representations for neural networks tend to be categorized as either static or dynamic graphs, which are referred to as define-and-run and define-by-run in the literature, respectively.
Technology companies such as Facebook and Google have large development teams dedicated to deep learning. Facebook is developing an ML stack comprised of many projects including PyTorch, Glow [Rotem et al. 2018] , and Caffe2 [Caffe2 . The Glow compiler [Rotem et al. 2018 ] is similar to NNVM and is intended to be a compiler for high-level computation graphs, for hardware accelerators. XLA [XLA Team 2017] , within Google's TensorFlow, is similar to the complete Relay stack. It provides a low-level IR for TensorFlow. TensorFlow [Abadi et al. 2016] , which is the most popular DL framework [Hale 2018] , supports static graphs. TensorFlow represents a neural network Relay 23 using a dataflow graph of primitive operators extended with restricted control edges. Previous work has explored the semantics of dataflow graph representation [Abadi et al. 2017 ]. TensorFlow's representation is sufficient for many state-of-the-art models, is easily ported to heterogeneous hardware back-ends, and allows for reverse-mode automatic differentiation [Abadi et al. 2016; Baydin et al. 2015] .
Unfortunately, static graph-based programming models have some limitations. In particular, unmodified TensorFlow cannot support building models where the shape of the computation graph is dependent on the input. TensorFlow fold address this particular problem [Looks et al. 2017a] . Modifying frameworks in this manner is a considerable engineering effort and does not scale.
Dynamic frameworks such as PyTorch [Paszke et al. 2017 ], Gluon [glu 2018], Chainer [Tokui et al. 2015] , and TensorFlow eager-mode [Shankar and Dobson 2017] are attempts to alleviate the challenges of static graph representations. In PyTorch the Python interpreter is responsible for control flow, constructing dataflow graphs as a side effect when it reaches tensor operations. These graphs enable automatic differentiation and execution on external devices. However, dynamic frameworks must re-optimize when the graph topology changes, costing CPU cycles and incurring communication overhead between the host machine and accelerators.
Low-Level Code Generation. Relay does not directly approach the problem of generating efficient low-level code for tensor operations, instead focusing on optimizing the code between operators and providing higher-level abstractions. For producing efficiently compiled operators, Relay relies on the TVM [Chen et al. 2018a ] compiler stack. Recent research on the TVM stack is focused on producing efficient operators (dense linear algebra kernels), such as generalized matrix multiplication (GEMM) and convolutions. Relay could use other high performance compilers for kernels such as Halide [Ragan-Kelley et al. 2013] , which TVM derived its IR and optimization model. Tensor Comprehensions (TC) shares common goals with the TVM framework, but achieves its goal through different techniques, such as polyhedral compilation rather than algorithmic schedules. TC could replace TVM in Relay's compilation.
Programming Languages Techniques in DL.
Relay is not the first attempt to apply programming language techniques to machine learning or related problems such automatic differentiation. Lantern [Wang et al. 2018 ] is a deep learning DSL in Scala that uses lightweight modular staging (LMS) to lower code into C++. LMS takes a graph as input from the user and converts it to an AST representation, similar to Relay's graph mode. LMS removes unnecessary abstractions. Going one step further than Lantern, Relay supports accelerators.
Flux.jl [Innes 2018b ] is a DL library written in Julia [JuliaLang , a numerical computing language. Like Relay, Flux.jl adds shapes to the type system; however, unlike Relay, it uses a mixture of compile-time inference and runtime checks to enforce shape constraints [Innes et al. 2017 ] and cannot perform platform-specific optimizations. Unlike Flux.jl, which tightly coupled with Julia, Relay is language agnostic and decouples frontend and IR considerations. Relay's features, such as type inference and deployment to multiple back-ends, can be easily reused by frameworks in arbitrary languages.
Additionally, previous work in higher-order differentiation from the PL community has informed Relay's design. In particular, we draw inspiration from various implementations of automatic differentiation [Baydin et al. 2015; Elliott 2009; Kmett et al. 2008; Pearlmutter and Siskind 2008; ThoughtWorks Inc. 2018a,b; Wang and Pothen 2017] , with particular attention to techniques that can compute higher-order gradients of higher-order programs. Zygote.jl [Innes 2018a ], like Relay, uses source code transformations to implement automatic differentiation.
Relay 27
A APPENDIX
The appendix is focused on providing complete references to the grammar, operational semantics, type rules, and partial evaluation for Relay, which are too large to include in the main text.
REAL r
::= R NAT n ::= N NAME name :: 
Semantics-If-True
match(s) { | C 1 (a 1 , . . . ,a n ) => e 1 | . . . | C i (a 1 , . . . ,a n ) => E | . . . | C m (a 1 , . . . ,a n ) => e m } ⇒ Γ; ∆ ′′ : V . Also omitted, for simplicity, are the effects of the type arguments in a function call; any such effects would be implemented by relations on operators and thus would be opaque to all other constructs in Relay other than resulting in a different value returned by an operator call. Note that since automatic differentiation is implemented as a macro over a Relay AST, the gradient expression has no semantics of its own beyond expanding the macro. Reference types are only generated internally by reverse-mode automatic differentiation and cannot be given in front-end user code. Relations cannot be defined in front-end user code either, and instead must be implemented and registered with operations. For simplicity, the rule for ADT definitions assumes that each ADT constructor takes the same n arguments (whereas each constructor may take any number of arguments, so long as they are all of kind Type) and the rule for function types assumes that each relation R takes the same n arguments, whereas a relation may take any subset of the set of all type arguments, argument types, and the output type as arguments (so long as they are all of kind Type). Note that the kind ADT corresponds to an ADT name (implemented as a global type variable); any ADT instance must instantiate all type parameters in the ADT definition, hence the use of a type call for giving a type to ADT instances (that is, an ADT definition is treated as a type-level function). | C 1 T m+1 , . . . ,T m+k (a 1 , . . . ,a n ) => e 1 . . . | C m T m+1 , . . . ,T m+k (a 1 , . . . ,a n ) => e m } : T ′ Fig. 14. Rules for deriving types of expressions and definitions. The unit type, (), is syntactic sugar for a product type with zero members. For simplicity, the arguments to ADT constructors have been omitted in the descriptions of the ADTs and each constructor is assumed to take the same n argument types (similarly, the names of the match parameters in the rule for match expressions are assumed to be the same in each branch. Relay also supports more sophisticated pattern-matching in match expressions but these rules omit this for simplicity). Type arguments and relations are omitted in the rule for gradient, as the present implementation of AD does not support them. Note that ADT constructors are given ordinary function types and can thus be called according to the same rules as any other function.
