FusionStitching: Deep Fusion and Code Generation for Tensorflow
  Computations on GPUs by Long, Guoping et al.
ar
X
iv
:1
81
1.
05
21
3v
1 
 [c
s.D
C]
  1
3 N
ov
 20
18
FUSIONSTITCHING: DEEP FUSION AND CODE GENERATION FOR
TENSORFLOW COMPUTATIONS ON GPUS
Guoping Long 1 Jun Yang 1 Kai Zhu 1 Wei Lin 1
ABSTRACT
In recent years, there is a surge on machine learning applications in industry. Many of them are based on popular
AI frameworks like Tensorflow, Torch, Caffe, or MxNet, etc,and are enpowered by accelerator platforms such
as GPUs. One important challenge of running Tensorflow computations on GPUs is the fine granularity prob-
lem, namely, FLOPS of individual ops are far from enough to fully exploit the computing power of underlying
accelerators. The XLA framework provides a solid foundation to explore this problem further. In this paper, we
propose FusionStitching, a novel, comprehensive Op fusion and code generation system to stitch computations
into large GPU kernels. Experimental results on four public models and two of our large inhouse applications
show another 55% (geometric mean) reduction of GPU kernel launches, compared to the XLA fusion baseline.
This increases the E2E performance of both of our latency critical inhouse applications up to 20%.
1 INTRODUCTION
Recent years in industry, there is a boom of machine learn-
ing applications in a diversified range of scenarios, includ-
ing images, speech/audio, NLP, CTR prediction, search
and recommender systems built on commodity graphs
at billions even trillions of scale, etc. Such workloads
are generally regular in computation, and benefit a lot
from modern high performance accelerators like GPUs or
TPUs. In addition, many of such models are based on
popular AI frameworks like Tensorflow(Abadi et al., 2015),
Caffe(Jia et al., 2014), Torch(tor, 2015), CNTK(Yu et al.,
2014), or MxNet(Chen et al., 2015). The challenge is how
to optimize such workloads, to achieve as much perfor-
mance as possible on modern hardware.
There are roughly two categories of computations in AI
workloads. One is enabled by optimized vendor libraries,
in particularMatMuls or 2D/3D convolutions and their vari-
ants. The other category includes foundational tensor op-
erators, elementwise computations, memory layout trans-
formations, and other workload specific ones. In order to
understand the relative importance of MatMul/Convolution
computations on various models, we collected data from
53,470 models on PAI (Platform for Artificial Intelligence)
at Alibaba. Depending on application domains and hard-
ware platforms, this number ranges from 40% to 70%.
Therefore, computations other than Matmul/Conv deserve
serious investigation to achieve decent performance. This
1Alibaba Inc.. Correspondence to: Guoping Long
<guopinglong.lgp@alibaba-inc.com>.
Copyright 2019 by the authors
0 5 10 15 20 25 30
Memory Footprint(log2)
0.0
0.2
0.4
0.6
0.8
1.0
Op
 P
er
ce
nt
ile
Mul
Add
Sigmoid
Conv2D
MatMul
Reduce
Figure 1.Memory Footprint Distribution of Most Popular Ops
work focuses on computation efficiency of this category on
GPU platforms.
A well known challenge of Tensorflow computations is op
granularity. Figure 1 shows the accumulated percentile dis-
tribution of memory footprints of six most frequent comput-
ing ops of those 53470 models collected from PAI. Note
that the reduce(orange) line denotes the collective statis-
tics of four ops, namely mean, sum, min, and max, while
other lines represent individual ops. The x-axis shows the
memory IO footprint size (in number of floats) measured
at logarithm scale (base=2), the bigger the better. As can
be shown, while memory footprints of MatMul/Conv2D
are generally larger than elementwise/reduce ones, most op
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
instances have small memory footprints. Therefore, opti-
mizations are vital to fully leverage computing resources
and memory bandwidth.
One way to solving this problem is replacing those
fine grained ops with a pre-implemented coarse grained
version(PAI, 2018). However, this approach is not scal-
able and incapable of adapting to diversified and fast evolv-
ing workload characteristics. The more principaled solu-
tion is compiler driven op fusion, tuning and code gener-
ation. One possibility is to seperate the schedule speci-
fication and implementation. By offloading the optimiza-
tion plan specification to the user, the runtime focuses
on details of optimization realization. Prior explorations
have shown promising results in some application domains,
such as Halide on image processing(Kelley et al., 2018),
loopy(Klo¨ckner, 2014) and TVM(Chen et al., 2018) on ar-
ray and machine learning computations. To make this lay-
ered approachwork in practice, we believe the performance
insight into the target workload domain is essential. With
such insights, and given a properly defined workloadmodel
and target platform, it is possible to go a step futher, by al-
lowing the user to specify computation only, rather than
implementation schedules, as is the case in Tensorflow.
In Tensorflow, the fast evolving XLA framework provides
a sound foundation to explore this problem further. XLA’s
approach of partitioning a Tensorflow graph into compi-
lable clusters, and transforming them into concise and
compact HloModules, opens up broad possibilities to fuse,
transform, and optimize computation kernels for GPUs.
Current XLA op fusion algorithm, either GpuInstruction-
Fusion or MultiOutputFusion, relies on a set of static
ShouldFuse rules in order to produce supposedly profitable
larger kernels. While these rules discern fusion opportu-
nities in many cases, it is usually compromised by excep-
tions, such as expensive elementwise ops, column reduc-
tions, batched matmuls, or memory layout transposes. In
addition, at the code generation phase, XLA requires all
ops in the fused computation to fit into a single parallel loop
emitter, and leverages the elemental ir emitter to compose
computations of previous ops into the root loop body. In
this approach, all ops in the fused computation in essence
must share the same implementation schedule. We call this
approach thread composition, as shown in Figure 2(a).
This work proposes FusionStitching, a deep fusion and
code generation system. One key feature of our system is
block composition at the codegen phase, as shown in Fig-
ure 2(b). To support this, we propose another ir emitter,
IrEmitterStitched, to stitch multiple computations together.
In theory, we allow each computation to have its own par-
allel loop emitter, and use on chip shared memory (scratch-
pad) as intermediary between producing and consuming
computations. In practice, however, due to the scarsity of
 ✁✂✄☎✆
✝☎✞✟✄✠
✡✟✞☛☞✌✂✌✍✟✎✏
✡✟✞☛☞✌✂✌✍✟✎✑
✡✟✞☛☞✌✂✌✍✟✎✒
✡✟✞☛☞✌✂✌✍✟✎✓
✔✂✄✂✕✕☎✕ ✖✟✟☛
 ✗✁☎✆☞✕☎
✔✂✄✂✕✕☎✕ ✖✟✟☛
 ✗✁☎✆☞✕☎
✘✂✙ ✚✁✄☎✂✆ ✡✟✞☛✟✛✍✌✍✟✎ ✘✜✙ ✢✕✟✗✣ ✡✟✞☛✟✛✍✌✍✟✎
Figure 2. Computation Composition
shared memory space, we leverage thread composition to-
gether with block composition.
This opens up further optimization tradeoffs and fusion op-
portunities. At the op fusion phase, due to the codegen
capability enabled by block composition, we can further re-
lax constraints that impose on XLA, and thereby enabling
much larger kernel granularity. Besides, to cope with much
bigger implementation space of larger kernel sizes, a sys-
tematic approach is desirable to specify, optimize and tune
kernel optimizations. Specifically, we make the following
contributions:
- We propose a novel deep fusion algorithm based on a
layered nodes structure along the span (critical path)
of the graph. Using critical path reduction as the driv-
ing heuristic, we consider not only producer/consumer
fusion opportunies, but also fine granularity ops that
occur in the same layer, in order to enlarge kernel gran-
ularity and reduce GPU launch overheads.
- We propose a comprehensive mechanism to spec-
ify implementation schedule space, resolve schedule
constraints, tune the search space and generate the
final implementation plan, represented as a set of
launch/schedule parameters.
- We propose IrEmitterStitched, another ir emitter to
support block level computation composition. The
core part of this ir emitter is a shared memory plan-
ning algorithm which orchestrates the code generation
of individual ops within the fused computation.
This paper is organized as follows. Section 2 discusses
motivation and overview of FusionStitching. Section 3
presents our deep fusion algorithm. Section 4 discusses
schedule specification, tuning and determination of opti-
mized implementation plans. Section 5 presents details of
code generation and the shared memory management al-
gorithm. Section 6 shows experimental results. Section 7
discusses related works and Section 8 concludes this work.
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
 ✁✂✄☎
✆✝✞✟✠✄☎
✆✝✞✟✠✄✡
 ✁✂✄✡
☛☞✌✍✎☞✏✑ ✒✓✔✕✖✗✕✘✙✚✛
✜✢✣✙✓✕✗✙✚✛ ✤✥✦✧★☞★✩✪✫✬✏✑
☛☞✌✍✎☞✏✭ ✒✓✔✕✖✗✕✘✙✚✮
✯✪✰✪✌☞✏✑ ✒✱✙✗✕✘✙✚✛
✯✧✩✏✑
✲✳✴✵ ✶✵ ✛✷✵ ✛✮✶✸
✜✹✺✱✙✻✱✼✽✮ ✕✾✖ ✜✿✔✓✖ ✽ ✛
✒✺✔✗❀✘❁ ❂✛✮
✲✳✴✵ ✶✵ ✛✷✵ ✛✷✸
✜✹✺✱✙✻✱✼✽✮ ✕✾✖ ✜✿✔✓✖ ✽ ✛
✒✺✔✗❀✘❁ ❂✛✮
✲✶✷❃✴✵ ✛✷✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✜❄✕✓❅✖ ✜✱❆❅❁ ✮✶❇ ❈✛✷❉✛✷❊✵ ✜❋●❍■
✲✶✷❃✴✵ ✛✷✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✲✶✷❃✴✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✜❄✕✓❅✖ ✜✱❆❅❁ ✛✷✵ ✜❋●❍■
✲✶✷❃✴✵ ✛✷✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✜❄✕✓❅✖ ✜✱❆❅❁ ✛✷❉✛✷✵ ●❏❏❑▲
✲✶✷❃✴✵ ✛✷✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✲✶✷❃✴✵ ✛✷✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✲✶✷❃✴✸
✜✹✺✱✙✻✱✼✽❃ ✕✾✖ ✜✿✔✓✖ ✽ ❂✛✮
✒✺✔✗❀✘❁ ❂✛✮
✜❄✕✓❅✖ ✜✱❆❅❁ ✛✷✵ ●❏❏❑▲
✲✶✷❃✴✵ ✛✷✸
✲✸
✲✸
✲✳✴✵ ✶✵ ✛✷✵ ✛✮✶✸
▼ ◆ ❖ P ◗ ❘ ❙ ❚ ❯
❱
Figure 3. A Motivating Example
2 MOTIVATION AND SYSTEM OVERVIEW
2.1 The Motivating Example
We consider the fusion and code generation problem of gen-
eral sub-graphs that consist of four types of ops: (1) Ele-
mentwise; (2) Shape modulation ops, such as Reshape, Bit-
cast, Transpose, etc; (3) Reduction; (4) BatchMatMul. We
include BatchMatMul because in some of our critical pro-
duction models, such ops usually involve workload specific
shapes, and cuBLAS kernels do not deliver satisfactory per-
formance.
Figure 3 shows a motivating example. We arrange ops in a
layered structure (denoted as circled numbers), with layer 9
being input ops (top), and layer 0 being output (BatchMat-
Mul (Dot.1) in this case). In complex graphs, such a lay-
ered structure proves to be very useful to fusion decision
making. Black arrows (from left to right) show data de-
pendances. Next to each op, there are annotating texts that
show important schedule/code generation attributes. The
(SplitDim, Sword) pairs are schedule parameters used for
tuning the implementation space (Section 3). One use of
this pair of parameters is to decide Blocks, the number
of thread blocks (CTA) used for computing this op. The
Shared attributes associated with reduce, exponential, and
divide are related to on chip shared memory management.
Together with the Size attribute, ALLOC or SHARE mean
we need to allocate space for the current op, or reuse a
buffer allocated for a previous op, respectively.
Our system provides the capability to fuse and generate op-
timized code for the entire graph. Whether it’s beneficial
to fuse dot depends on workloads. In some of our inhouse
workloads, the batched dot shape is too marginal to get any
benefit calling cuBLAS, and this pattern happens to be the
core part of an inner loop body. In this case, fusing every-
thing proves to be very useful. In general, we leave the
decision of whether to fuse BatchMatMul to the user.
2.2 FusionStitching: The System Overview
The system overview is shown in Figure 4. On the high
level, the system takes a HloModule as the input, passes
three stages of processing (op fusion, schedule planning
and code generation), and finally generates the LLVM IR.
In the computation fusion stage, we first perform a
Work/Span (critical path) analysis(Guy, 1996), and allocate
a layer number (as shown in Figure 3 for each op according
to its depth in the span. Then starting from the root (such
as Dot.1 in Figure 3), we fuse ops iteratively across differ-
ent span layers, as long as the fusion decision passes the
schedule consistency check. The fusion process iterates un-
til no fusion opportunity is available. Then the transformed
HloModule is passed on to schedule planning.
The implementation space for a large fused computation
can be huge. Schedule planning searches a domain driven,
well defined schedule space for optimized implementa-
tions. It takes a fusion plan as input, and generates op-
timized schedule parameters, shared memory usage plans
and launch dimensions for the following code generation
phase. It also provides performance heuristics regarding
current fusion plan as feedback information to Schedule-
ConsistencyChecker. The scheduling process involves four
important submodules: schedule generation, performance
library, tuning and shared memory planning, which will be
presented in detail in section 4 and section 5.
Based on schedule parameters, the code generation pass fi-
nally allocates shared memory, set the kernel launch dimen-
sion, and emits LLVM IR code for each op according to its
tuned parameters. Note that the use of shared memory as
an intermediary among ops is important to achieve decent
performance. One example is shown in Figure 3. If no
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
 ✁✂✄ ☎✄✆✄✝✞✟✠✁✆✡☛☞✄✂✌✍✄ ✎✍✞✆✆✠✆✏ ✁✑✒✌✟✞✟✠✁✆ ✓✌✔✠✁✆
✕✁✝✖✗✡✒✞✆
✘✆✞✍✙✔✠✔
✡☛☞✄✂✌✍✄
☎✄✆✄✝✞✟✠✁✆
✎✄✝✚✁✝✑✞✆☛✄
✛✌✆✠✆✏
✎✄✝✚✁✝✑✞✆☛✄
✜✠✢✝✞✝✙
✡☞✞✝✄✂
✣✄✑✁✝✙
✎✍✞✆✆✠✆✏
✤✄✄✒ ✓✌✔✠✁✆
 ✁✂✄
☎✄✆✄✝✞✟✠✁✆
✥✍✁✣✁✂✌✍✄
✜✜✦✣ ✧★
✡☞✞✝✄✂
✣✄✑✁✝✙
✘✍✍✁☛✞✟✠✁✆
✜✞✌✆☛☞
✤✧✑✄✆✔✠✁✆✔
✡☛☞✄✂✌✍✄
 ✁✆✔✠✔✟✄✆☛✙
 ☞✄☛✖✄✝
✩✌✟✒✌✟
✣✞✆✞✏✄✝
Figure 4. The System Overview
shared memory, expensive ops like exponential and divide
can only be composed through thread composition. In this
case, computation of these ops will be nested into the inner
loop of the root dot op, causing notable performance de-
gredation due to the duplicated computation. With shared
memory, threads within a thread block for different ops can
cooperate differently, thus opens up more optimization op-
portunities (section 5).
3 DEEP FUSION
3.1 The Work/Span Analysis
Work/Span analysis is a handy tool to analyze parallel work
and the critical path of computation graphs(Guy, 1996). In
our implementation, we assign a number, the span to each
instruction of the HloModule. First, the root instruction
have zero span. For any other instruction, its span equals
the maximum span of its users plus one. Work/Span analy-
sis is also useful to profile entire Tensorflow graphs. How-
ever, standard Work/Span analysis works fine only when
the graph is absent of dependancy loops. It is not uncom-
mon for practical Tensorflow graphs to include large, possi-
bly nested while loops. In this case, we perform an prepro-
cessing step to partition all nodes into multiple subgraphs,
each belonging to a separate frame context. We then per-
form Work/Span analysis for each frame context indepen-
dently.
After running this analysis, the maximum span assigned
is the length of the critical path. Instructions with the same
span are on the same layer (as shown in Figure 3), and there
are no data dependances among them. Next we present our
fusion algorithm that leverages this information to enlarge
the kernel granularity by effectively reducing the span of
the computation.
3.2 The Fusion Algorithm
Through Work/Span analysis, we partition all instructions
within a module into numerous layers (as shown in Fig-
ure 3), where instructions in each layer have the same span.
Today most AI models rely on library calls to performMat-
Mul/Conv. Since we do not fuse across library calls, we
are interested in computation subgraphs that exist in be-
tween any two consecutive library call layers (LC-layer).
The basic intuition of our fusion algorithm is to fuse as
many instructions as possible, considering various fusion
constraints, in the subgraph of computations between two
LC-layers.
Starting from a given LC-layer, up to the next LC-layer
(roof ), for each layer (denoted as the root layer in discus-
sions below), repeat the following two steps. First, we per-
form an intra layer ElementwiseFusion at the root layer, re-
sulting in a set of fused computations. Second, for each
fusion instruction (fusion root) in the root layer, we use Al-
gorithm 1 to perform sub-graph fusion up to the next roof.
ElementwiseFusion targets intra layer nodes without pro-
ducer/consumer relationships. The primary target is for
small weight accumulation layers which occur frequently
in training graphs. For a large number of such fine grained
(eg. < 10us) kernels, fusing them together can reduce
substantive launch overheads. The exact number of fused
computations to generate depends on two factors. One is
schedule compatiblity. In practice, elementwise instruc-
tions within a layer naturally fall into a few groups accord-
ing to output shapes. We will discuss more on schedule
planning in the next section. The other factor is the fused
memory footprint. We use a tunable threshold parameter to
control the fusion granularity, in order to avoid extra large
elementwise computations with too many outputs.
One implementation of subgraph fusion, starting from fu-
sion root, up to the next roof, is shown in Algorithm 1.
The map hlo span is the result of the Work/Span analysis,
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
Algorithm 1 Subgraph Fusion Algorithm
Input: fusion root, roof , hlo span
Initialize curr span = hlo span[fusion root]
Initialize empty set fused
Initialize empty set giveup
for l = curr span+ 1 to roof − 1 do
Initialize hlos be all instructions with span == l
for hlo : hlos do
if SchdConsistent(fusion root, hlo, fused, giveup)
then
Fuse(fusion root, hlo)
fused.insert(hlo)
else
giveup.insert(hlo)
end if
end for
end for
recording the span for all instructions. The algorithm tra-
verses instructions layerwise, starting from the next layer,
up to the roof. During this traversal, instructions are fused
(put in the fused set), or gave up (put in the giveup set).
The procedure SchdConsistent decides whether or not to
fuse an instruction hlo with the fusion root. First, it checks
if hlo has a user in the giveup set. If so, fusion stops in
order to avoid potential cyclic dependance loops. Second,
it checks if hlo has a user in the fused set. If not so, fusion
stops because we consider producer/consumer fusion only
here, and leave the other case to ElemwiseFusion, as dis-
cussed above. Finally, it checks if it’s possible to resolve
an optimized schedule for the fused computation, and stops
fusion if not. We will discuss schedule planning and opti-
mization in more detail in Section 4.
4 SCHEDULE PLANNING
4.1 Schedule Specification
In a fused computation, each instruction has an output
shape, which defines the total work space. The implemen-
tation space for each instruction can be huge. It is pro-
hibitively expensive to exhaust the entire space in order to
get the most optimized kernel. Before we discuss our trade-
off on this issue, let’s reiterate two major objectives of op
fusion: (1) reduce the memory footprint of the fused com-
putation; (2) reduce the number of kernel launches due to
the fine grained nature of many Tensorflow ops. In prac-
tice, most fused computations in our workloads are mem-
ory intensive, elementwise computations. Thus our design
rationale is to facilitate the composition of numerous in-
structions within a kernel in order to fully take advantage
of hardware resources, rather than pursuing extreme perfor-
mance of individual ops.
 ✁✂✄ ☎ ✆
✝✞✟✠✡☛☞✌✍☛✠☞✎✝✏✁
✝✎✑✠✡☛☞✌✍☛✠☞✎✝✏☎
✒✓✔✡☞
✒✕✖✎✗✠☞✎✝✏✘
✙✖✔✍✚✒
✛✜✢ ✣✤✥✦✧★✩✦✪
✫✬✭ ✮✩✜✤✯✰ ✱ ✰✢✜✲✧✳
✴✜✲ ✵✫✬✭ ✧✫✶ ✱ ✷✳ ✧✫✶ ✸ ✰✹✩✫✭✺✧✫✶✳ ✻✻✧✫✶✼ ✽
✮✩✜✤✯✰ ✾✱ ✰✥✿✹✦❀✧✫✶✦✬✰✫✜✬✰✵✧✫✶✼✳
❁
✂ ✁   ✆✘✄
✝✎✑✠✡☛☞✌✍☛✠☞✎✝✏✁
✝✞✟✠✡☛☞✌✍☛✠☞✎✝✏✘
✒✓✔✡☞
✒✕✖✎✗✠☞✎✝✏☎
✙✖✔✍✚✒
❂✜✩★✶✬ ✣✤✥✦✧★✩✦✪
✫✬✭ ✮✩✜✤✯✰ ✱ ✰✢✜✲✧✳
✴✜✲ ✵✫✬✭ ✧✫✶ ✱ ✰✹✩✫✭✺✧✫✶✻❃✳ ✧✫✶ ✸ ✬★✶✺✧✫✶✰✳ ✻✻✧✫✶✼ ✽
✮✩✜✤✯✰ ✾✱ ✰✥✿✹✦❀✧✫✶✦✬✰✫✜✬✰✵✧✫✶✼✳
❁
✂ ☎   ✆✄ ✂ ✘   ✆✄
Figure 5. An Example of Schedule Space for Reduce
For each instruction, we define three parameters on the out-
put shape (the work space) to fully specify an implemen-
tation schedule: split dim, sword, and sched type. The
idea is to split the work space into multiple data chunks,
where each thread block (CTA) works on a chunk. Here,
split dim denotes a dimension where we split the work
space. sword denotes how we partition the dimension
split dim. sched type can be either Row or Column.
As an example, Figure 5 shows a Row schedule (left) and
a Column schedule (right) for Reduce, both reducing in-
put tensors of 7 dims to output tensors of 5 dims(for the
left Row schedule dimensions 3, 4 are reduced, while for
the right Column schedule dimensions 2, 3 are reduced,
which correspond to the gray cells in the figure). blocks
denotes the number of GPU thread blocks, or data chunks
after partitioning the entire work space. In Row schedule,
we use dims on the left (more significant) of split dim as
blocks. The C code excerpt shows details on how to calcu-
late blocks using split dim,sword, and shape dims.
Let num dims be the number of dims for a tensor shape.
split dimmust be integers in range of [0, num dims). Let
K be the the size of dimension split dim. sword must be
a divisor ofK . sched type can be either Row or Column.
Given a hlo instruction, the Cartesion product of legal value
sets of split dim, sword, and sched type defines the en-
tire schedule space.
The size of the schedule space of a single op depends on
its output shape dims, but is usually small in practice. This
is important for compilation speed. Our schedule specifica-
tion has relatively small search space. It is not designed for
exhausting all implementation possibilities. Yet, together
with tuning, it enables most important kernel optimizations
we want for GPUs, while at the same time allows simple
and concise code generation implementations, thanks to the
computation regularity of Tensorflow ops.
4.2 Schedule Constraints and Propagation
The output shape of the fused computation is the same as
that of its root (output) instruction. Given a valid schedule
of the root instruction, we must decide whether it is satis-
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
Table 1. Schedule Constraints Propagation Rules
Op Rules
Elementwise Pass Row, Column
Transpose split dim <= min trans dim Pass Row
split dim >= max trans dim Pass Column
Reduce split dim <= min reduce dim Pass Row
split dim >= max reduce dim Pass Column
BatchDot split dim < num dims− 2 Pass Row
Reshape Reshape transform split dim and sword
Pass Row, Column
Broadcast Broadcast transform split dim and sword
Pass Row, Column
fiable by all other instructions of the computation. We use
an algorithm conceptually similar to Work/Span analysis to
resolve schedule constraints for all other instructions. Note
that for each instruction, the valid schedule is defined on its
output shape. If schedule constraints are satisfiable, we re-
iterate and back propagate the schedule to its input shape(s),
which correspond to the output shape(s) of its operand(s).
Otherwise, the schedule initiated from the root instruction
is not satisfiable for the computation.
Figure 5 shows an example of schedule constraints. For
Reduce, we require all reduction dims be in the same
thread block in order to balance between codegen simplic-
ity and kernel efficiency. In this case, if split dim <
min reduce dim, only sched type = Row is meaning-
ful. The case for Transpose is similar. Other constraints
include the divisability requirement needed by sword.
Table 1 summarizes rules of different ops for schedule
constraints propagation. For Elementwise, propagate the
schedule either Row or Column, back to its operand(s).
For Reduce, only Propagate Row schedule if split dim <
min reduce dim. If split dim >= min reduce dim,
adjust split dim and sword according to reduce dims.
Propagate Row or Column schedule if split dim =
min reduce dim. Only propagate Column schedule if
split dim > min reduce dim. The case for Transpose is
similar. For BatchDot, only Row schedules are propagated
and split dim must be a batch dim (< num dims − 2),
otherwise the schedule is not satisfiable.
The Reshape or Broadcast modulates shapes. Therefore,
we first transform the output split dim and sword to the
input split dim and sword, according to schedule specifi-
cation shown in Figure 5. Then we propagate the schedule,
either Row or Column.
4.3 Schedule Tuning
There is always a valid Row schedule for any fused com-
putation, with split dim = 0 and sword = 1. In this case,
we only use one thread block for all instructions. How-
ever, in practice this will always lead to under unitilization
of GPU resources. Together with the performance library
(discussed blow), schedule tuning iterates over all candi-
date schedules of the root to look for the most efficient one.
We use this optimized schedule to direct code generation.
If the fused computation has one single root, we iterate over
all its candidate schedules. For each schedule, we test if it
is satisfiable. If true, we lookup the performance library
w.r.t. the schedule, and sum up the kernel execution time
of all ops of the computation. The schedule with the best
performance is chosen for code generation.
If there are multiple roots, we use a two-stage approach to
speedup exploration of the search space. In the first stage,
we iterate over all roots. For each root, we compute two
sets, one is the valid blocks set (shown in Figure 5), the
other is the set of valid schedules corresponding to valid
blocks. Once blocks sets for all roots are available, we
perform intersection on these sets to resolve all candidates
blocks that satisfiable by all roots. This reduces the perfor-
mance tuning space that needs to explore next.
The second stage starts from the resultant blocks set agreed
by all roots. We iterate over all schedules corresponding
to the blocks set. For each schedule, we accumulate the
kernel execution for each root, and sum them up to obtain
the whole performance metric for the computation. The
schedule with the best performance is our chosen target.
In implementation we perform two additional optimiza-
tions. First, in evaluating performance of individual instruc-
tions, we sometimes ignore those computationally trivial
ops, such as Reshape, broadcast, small Transpose ops, etc.
Such ops can be inlined via thread composition (similar to
ElementalIREmitter in XLA) with negligible performance
loss. Yet if we keep them, their strict modulation of shapes
sometimes rejects highly optimized schedules. Bypassing
them can make optimized schedules be satisfiable.
The second optimization is further pruning of the search
space if there are multiple roots. During the second stage
of schedule evaluation, we always keep the best performing
schedule achieved so far. If, during the evaluation process
of some schedule, the execution time accumulated has al-
ready exceeded that of the total latency of the best schedule,
we simply skip the process and continue to explore the next
schedule.
4.4 The Performance Library
The performance library is a key-value store, which con-
tains kernel performance data of various types of instruc-
tions under different implementation schedules. Common
features included in a key include opcode, shape, split dim,
sword, sched type and thread block size. The thread block
size is an integer in [1, 1024], and must be a multiple of
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
GPU warp size (32). There are also op specific features.
For instance, Reduce (or Transpose) has an additional fea-
ture, reduce warps (or trans warps), meaning how many
GPU warps in the thread block are used to perform the re-
duction (or transpose) loop.
We keep the performance library in permenant storage for
repeated usages. At system initialization, the library is
loaded into memory. During the tuning process, the li-
brary module takes schedule keys as lookup requests. If the
key exists in the library, the result is returned immediately.
Otherwise, the module constructs a CUDA C kernel from
the key, compiles and executes it on the GPU. We use the
nvprof tool to collect the kernel execution time and insert
the new key-value pair to the library for future use.
When a key misses the library, the kernel generation and
performance collection seems to be costly operation during
JIT compilation. This is true in the initial warmup phase.
Later on we observe high degree of data reuse in our work-
loads. In addition, as discussed before, most kernels only
take several to tens of microseconds to execute. Neverthe-
less, it should be possible to build a learning model to pre-
dict a performance metric from features in the key, and re-
turn the predicted value to the tuning process immediately,
thus shortening the critical path by offloading the kernel
generation, compilation and execution asynchronously. We
will leave this as future work.
The goal of fusion is to pack computations of all instruc-
tions into a single kernel. In schedule tuning, we are us-
ing accumulated performance of individual ops to measure
the performance of the kenel of the entire fused computa-
tion. This approach does not mean to predict exact execu-
tion time of the fused kernel, but works well in reaching an
optimized set of parameters to effectively direct code gen-
eration.
Based on the concise specification of schedule space, effec-
tively schedule space exploration and performance library
driven tuning mechanism, FusionStitching can efficiently
enumerate huge number of fusion possibilities, thus open
much more opportunity for the subsequent code generation
phase, as illustrated in Section 6
5 CODE GENERATION
5.1 Shared Memory Planning
The on chip shared memory is essential to facilitate thread
block composition of numerous compute expensive ops.
This is important to achieve relatively large and optimized
kernels. To perform shared memory planning, we first iden-
tify candidate ops which may need shared memory, then
prioritize shared memory usage to most critical ops when
space is not enough, and facilitate space sharing among ops
on the data flow.
5.1.1 Size Requirements Analysis
Size requirements analysis identifies all ops that may use
shared memory. In the example computation shown in
Figure 3, ops in green boxes have shared memory require-
ments. There are several cases to note.
One is direct allocation. For Reduce or BatchDot, if it is not
the root instruction, we must allocate shared memory for its
itermediate results, allowing consumer ops to use seperate
parallel loop emitters to generate code.
Other cases are related to expensive elementwise ops, such
as Exp, Divide, Log, etc. In general, if such an instruction
has multiple users, we may want to allocate shared memory
to buffer its results in order to achieve as much computation
reuse as possible. Note that this is true even for inexpensive
ops as well. This is performance consideration. However,
if size requirements have reached a limit, we shall give up
shared memory usage in a proper order in these cases, by
recomputing those elementwise ops to ensure correctness.
For an expensive elementwise op, sometimes even if it has
only one user, we must use shared memory in order to
achieve acceptable performance. One example is shown
in Figure 3. The Divide.1 is followed by Bitcast.1, which is
then followed by a BatchMatMul (Dot.1). Due to high de-
gree of data reuse in Dot.1, shared memory is important for
performance here. To address this issue, we analyze data
flow in this case in order to identify all such expensive ops.
5.1.2 Size Shrinking
Size shrinking is a technique when size requirement of the
fused computation exceeds the shared memory limit. One
main reason this happens is when blocks is small, where
each thread block needs to process a large chunk of data.
The basic idea to this problem is to trade shared space for
recomputation. To reduce size requirements, we start from
inexpensive elementwise ops with multiple users, then ex-
pensive elementwise ops with multiple uses, finally expen-
sive ops with transitive uses by BatchMatMul. Even if we
follow this order, there may still be multiple candidates to
choose. In this case, we prioritize the one that is closest to
the root instruction in the span of the graph.
Size shrinking is a best effort approach to reduce shared
memory usage. If, after shrinking and space sharing anal-
ysis (discuss blow), there is still not enough space, a
feedback signal is generated back to ScheduleConsistency-
Checker in the fusion module to trigger other fusion deci-
sions. In practice, this happens only on large fused compu-
tations where schedule planning fails to produced an opti-
mized one. Thus this feedback provides an effective mech-
anism to control fusion granularity.
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
Algorithm 2 IrEmitterStitched
Input: hlo, root, shared, schedule, generators
if !root && !shared.count(hlo)&&!dot&&!reduce
then
return ElementalIrEmitter(hlo)
end if
StitchedEmitter(hlo, schedule)
if shared.count(hlo) then
EmitWriteSharedArray
end if
if root then
EmitWriteOutputArray
else
EmitGenerator(generators, hlo)
end if
5.1.3 Space Sharing
Space sharing is an effective technique to reuse shared
memory space. As shown in the example in Figure 3. Re-
duce.2 reuses shared space allocated for Reduce.1. Divide.1
reuses shared space allocated for Exponential.1.
To facilite sharing, we first build a dominance
tree(Cooper et al.) starting from root instruction. Then
we perform another round of data flow analysis using the
dominance tree to realize space sharing. As to the case
shown in Figure 3, shared space allocated for Reduce.1
can be shared after Expontial.1, and can be reused by Re-
duce.2 because Reduce.2 dominates Reduce.1. Similarly,
Divide.1 dominates and reuses the buffer allocated for
Exponential.1.
5.2 Code Generation
The schedule and shared memory planning setup the foun-
dation for codegen. We build our work based on the
hlo visitor framework available in XLA. The GpuElemen-
talIrEmitter in XLA implements thread composition of
computations. Algorithm 2 sketches the basic idea of our
block composition procedure, IrEmitterStitched.
There are several inputs to IrEmitterStitched. hlo is the tar-
get instruction to emit code for. schedule and shared are
outputs of schedule and shared memory planning, respec-
tively. root tells if hlo is the root instruction. generators
is similar to the generators map in XLA, the difference is
on shared memory handling. If hlo is not the output instruc-
tion, is neither BatchMatMul nor Reduce, and does not use
shared memory as well, we fallback to ElementalIrEmit-
ter in XLA; otherwise StitchedEmitter is called to emit
code based on an optimized schedule. When this is done,
we store computation results to shared memory if required
by calling EmitWriteSharedArray. If root is true, which
means hlo is an output of the computation, code is emit-
Table 2. Benchmarks
Name Category Description
LR Training Logistic Regression
W2V Training Word2Vector
RNN Training Recurrent Neural Network
BiRNN Training Bidirectional RNN
Speech Training Speech Recognition
NMT Inference Neural Machine Translation
ted to write results to global memory via EmitWriteOut-
putArray. If root is false, we insert an entry to generators
map for hlo, in order to support further composition of hlo
with other instructions. In implementation, we encapsu-
late codegen logic related to computation results, including
shared/globalmemory handling into anOutputManager ob-
ject, as shown in Figure 4.
6 EXPERIMENTAL EVALUATION
6.1 Experimental Setup
We implemented FusionStitching on Tensorflow 1.7. Ex-
perimental results are collected on a Pascal GPU, with 3584
cores and 64KB shared memory per SM. Table 2 sum-
marizes our benchmarks, ranging from small to medium
public models to large inhouse applications in our pro-
duction environments. LR, W2V, RNN and BiRNN are
from public(aymericdamien). all with default configura-
tions. Speech is an inhouse speech application, training
voice samples collected from millions of consumer side
portable audio systems.
NMT is an inhouse variant of neural machine translation
based on the attention mechanism(Vaswani et al., 2017;
Xiong et al., 2018). There are two use cases. One is offline
translation of descriptions of billions of commodities from
one language to another. In this case, batch processing is
available to maximize efficiency. The other use case is for
realtime, online communication between sellers and buyers.
In this case, batch size is small, and latency is critical. In
both cases, every millisecond of performance imporvement
is of significance in practice. There is strong incentive to
optimize as much as possible beyond MatMul/Conv.
Our evaluation baseline is the XLA implementation of fu-
sion and code generation. It is important to note that XLA
has already done excellent work on common elementwise
and producer/consumer patterns. With FusionStitching, we
are interested in how much additional imporvement is pos-
sible for these workloads.
6.2 Fusion Potential Analysis
Optimization targets of FusionStitching are subgraphs of
ops except library calls, which in our case only cuDNN and
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
LR W2V RNN BiRNN Speech NMT
0
20
40
60
80
100
Ex
ec
ut
io
n 
Br
ea
kd
ow
n
Figure 6. Execution Breakdown
cuBLAS are relevant. Figure 6 shows execution breakdown
between MatMul/Conv and other ops for all benchmarks.
As can be seen, the potentially fusable component (the top
portion) takes 20% to 50%. Large, dense MatMul/Conv
ops are friendly to GPUs, but are computationally costly.
In practice we tend to use less expensive ops for acceptable
accuracy. In addition, some MatMul/Conv ops have partic-
ular sizes where performance gain is very marginal to call
vendor libraries. Deep fusion and efficient code generation
is critical for performance in these cases.
6.3 The Fusion Ratio
One important goal of fusion is to enlarge granularity, thus
reduce the number of GPU kernels launched. We measure
the ratio between the number of kernels of FusionStitching
and that of the baseline (excluding library call kernels). We
use nvprof to collect details of kernels information. The
result is shown in Figure 7.
The fusion results depend on workloads. For most of them,
the fusion rate is less than 0.5. This means FusionStitching
can reduce the number kernels further to less than half the
number of the baseline. W2V has the highest fusion ratio
(0.82), because the core computation pattern in this case is
friendly to XLA, with limited room left for futher fusion.
FusionStitching performs best on Speech (0.25). In this
case, there are complex interaction patterns among reduce,
transpose, concat, and elementwise ops. FusionStitching
handles them gracefully.
6.4 Performance Speedup
The ultimate goal of fusion and code generation is for
performance. Figure 8 shows results for all workloads.
We report three numbers for each benchmark. The
FusionSpeedup (left) measures performance imporve-
LR W2V RNN BiRNN Speech NMT
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Fu
sio
n 
Ra
tio
Figure 7. Fusion Ratio
ment of the fusable portion only (in contrast to Mat-
Mul/Conv portion, as shown on the top of Figure 6). e
use FusableRatio to denote the execution time ratio of
the fusable portion. The end to end (E2E) speedup (right)
measures performance speedup of the whole network. The
predicted E2E (middle) predicts the actual E2E speedup us-
ing the following formula:
1 + FusableRatio ∗ (1− 1
FusionSpeedup
)
The FusionSpeedup ranges from 1.15 (W2V) to 3.5
(Speech). The average speedup (geometric mean) is 1.74.
This speedup roughly corresponds to the reciprocal of the
fusion ratio. The reason for this is that, in most fusion cases
in these workloads, ops are generally fine grained, mem-
ory intensive. Fusing them together effectively reduces
launch overheads and memory footprints. This motivates
us to introduce the above empirical formula to predict E2E
speedup. As can be seen, predicted speedups are close to
measured E2E speedup numbers.
Besides FusionSpeedup (capability measurement),
FusableRatio (potential measurement) also has strong
impact on E2E performance speedups. E2E speedup from
FusionStitching varies depending on workloads, ranging
from 5% to 20%, with geometric mean 13%.
6.5 Shared Memory Analysis
In FusionStitching, on chip shared memory is essential to
composing numerous ops with different parallel loop emit-
ters together. Table 3 summarizes shared memory usage be-
haviors. The Average column shows on average how much
shared memory (in bytes) has been allcoated for each ker-
nel. The Max column shows the maximum space (in bytes)
allocated. We set a upper limit (currently 20KB) for shared
memory usage of a kernel. Once the requested size exceeds
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
LR W2V RNN BiRNN Speech NMT
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Sp
ee
du
p
Fusion Speedup
Predicted E2E Speedup
E2E Speedup
Figure 8. Performance Speedup
Table 3. Shared Memory Statistics
Workload Average Max #Shrink Shared Ratio
LR 64 128 0 0.
W2V 96 288 0 0.
RNN 1344 3072 0 0.
BiRNN 1376 5120 0 0.
Speech 9504 16416 3 0.01
NMT 256 10432 0 0.17
this limit, the shrinking process is triggered. The #Shrink
column shows how many kernels have triggered the shrink-
ing process. Finally, the last column shows on average, the
percentage of space that is shared by multiple ops of the
total allocated space for the kernel.
Different workloads exhibit very different shared memory
behaviors. LR, W2V, RNN and BiRNN have relatively sim-
ple producer/consumer patterns, and neither size shrink-
ing nor sharing happens. Speech has large requirements
for shared memory. This is in part due to large computa-
tion granularity. In addition, shape modulation ops (such
as transpose, etc) sometimes result in large thread block
sizes in schedule planning, increasing shared memory re-
quirements. While little (1%) allocated space is shared in
Speech, this number is 17% for NMT, indicating certain de-
gree of computation results reuse in the graph, as illustrated
in Figure 3. The pattern in this figure is one of the compu-
tationally intensive subgraphs of NMT.
7 RELATED WORK
GPU kernel fusion, inspired from classical loop
optimizations(Ding & Kennedy, 2004; Kennedy & Allen,
2002), is known to boost performance in other application
domains. In database domain, KernelWeaver(Wu et al.,
2012) proposed transformations to fuse execution of
multiple operators into a single kernel. This work provided
support for both thread and block (CTA) composition of
operators, yet with little support for tuning of implementa-
tion schedules. In the HPC domain, (Wahib & Maruyama,
2014) formulated GPU kernel fusion as an combinatorial
search problem, and searched the solution space for an
optimized fused kernel. Our work targets Tensorflow
computation graphs, and proposes dedicated fusion, tuning
and code generation to achieve high performance.
The parametric representation of the implementation
schedule is inspired from Halide(Kelley et al., 2018) and
TVM(Chen et al., 2018; tvm). However, instead of rely-
ing on users to specify schedule details, we propose a
compact and efficient schedule specification, and tuning
framework for Tensorflow graphs. Experimental results
show decent performance gain on the fusable portion of
the graph. The layered span graph, used in our fusion algo-
rithm, is inspired from Work/Span (Guy, 1996) analysis of
parallel computation DAGs and layered dependance graph
representation(Ma et al., 2016) of stencil kernels.
In our work, we do not fuse dense DNN layers, and
leverage vendor libraries for performance. However,
there are recent advances on code generation of fast
DNN kernels. (Anderson & Gregg, 2018) proposed a so-
lution for selecting fast kernel implementations in the
global context by formulating it as a PBQP problem.
Boda(Moskewicz et al., 2017) is a code generator that
generates code for CNN layers on mobile platforms.
Latte(Truong et al., 2016) is a DSL system for DNN allow-
ing users to specify, synthesize and optimize code for NN
layers. SLINGEN(Spampinato et al., 2018) is another DSL
system which takes mathematical specifications and gener-
ates optimized C functions for linear algebra operators with
small input sizes. These research are relevant but comple-
mentatory to our work.
8 CONCLUSION AND FUTURE WORK
In this paper we propose FusionStitching, a deep fusion
and code generation system based on the XLA compila-
tion framework for Tensorflow computations. Our system
features a critical path analysis to drive fusion decisions,
a novel domain specific schedule specification and tuning
mechanism for kernels, and a shared memory optimization
technique to enable composition of large kernels. Experi-
mental results show notable reduction of GPU kernels, and
reasonable E2E performance speedups on our benchmarks.
In practical workloads, many DNN layers only have small
to medium sizes. With recent advances on DNN kernel gen-
eration, especially on powerful hardware with mixed pre-
cision functionality, it would be interesting to fuse DNN
layers as well and solve a global optimization problem.
FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs
REFERENCES
Tvm: Open deep learning compiler stack. URL
https://github.com/dmlc/tvm.
Torch nn, 2015. URL https://github.com/torch/nn.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro,
C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat,
S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Joze-
fowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane´, D.,
Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M.,
Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.,
Vanhoucke, V., Vasudevan, V., Vie´gas, F., Vinyals, O., Warden,
P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. Tensor-
Flow: Large-scale machine learning on heterogeneous systems,
2015. URL http://tensorflow.org/. Software avail-
able from tensorflow.org.
Anderson, A. and Gregg, D. Optimal dnn primitive selection with
partitioned boolean quadratic programming. In Proceedings
of the 2018 International Symposium on Code Generation and
Optimization, Vienna, Austria, 2018.
aymericdamien. Tensorflow-examples. URL
https://github.com/aymericdamien.
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao,
T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and
efficient machine learning library for heterogeneous distributed
systems. CoRR, abs/1512.01274, 2015.
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H.,
Cowan, M., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Kr-
ishnamurthy, A. Tvm: An automated end-to-end optimizing
compiler for deep learning. In Proceedings of Operating Sys-
tems Design and Implemention (OSDI), 2018.
Cooper, K. D., Harvey, T. J., and Kennedy, K. A simple, fast
dominance algorithm.
Ding, C. and Kennedy, K. improving effective bandwidth through
compiler enhancement of global cache reuse. Journal of Paral-
lel and Distributed Computing, 64:108–134, 2004.
Guy, B. Programming parallel algorithms. Communications of
the ACM, Volume 39 Issue 3:85–97, 1996.
Jia, Y. Q., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Gir-
shick, R., Guadarrama, G., and Darrell, T. Caffe: Convolu-
tional architecture for fast feature embedding. pp. 675–678,
2014.
Kelley, J. R., Adams, A., Sharlet, D., Barnes, C., Paris, S., Levoy,
M., Amarasinghe, S., and Durand, F. Halide: decoupling algo-
rithms from schedules for high-performance image processing.
Communications of the ACM, Volume 61 Issue 1, 2018.
Kennedy, K. and Allen, J. R. Optimizing Compilers for Modern
Architectures: A Dependence-based Approach. Morgan Kauf-
mann Publishers Inc., San Francisco, CA, USA, 2002. ISBN
1-55860-286-0.
Klo¨ckner, A. Loo.py: transformation-based code generation
for gpus and cpus. CoRR, abs/1405.7470, 2014. URL
http://arxiv.org/abs/1405.7470.
Ma, W. J., Gao, K., and Long, G. P. Highly optimized code gener-
ation for stencil codes with computation reuse for gpus. Jour-
nal of Computer Science and Technology, Volume 31 Issue 6:
1262–1274, 2016.
Moskewicz, M. W., Jannesari, A., and Keutzer, K. Boda: A
holistic approach for implementing neural network computa-
tions. In Proceedings of the Computing Frontiers Conference,
CF’17, pp. 53–62, New York, NY, USA, 2017. ACM. ISBN
978-1-4503-4487-6. doi: 10.1145/3075564.3077382. URL
http://doi.acm.org/10.1145/3075564.3077382.
PAI. Bringing tvm into tensorflow for optimizing
neural machine translation on gpu. 2018. URL
https://tvm.ai/2018/03/23/nmt-transformer-optimize.html.
Spampinato, D. G., Traver, D. F., Bientinesi, P., and Pschel, M.
Program generation for small-scale linear algebra applications.
In Proceedings of the 2018 International Symposium on Code
Generation and Optimization, Vienna, Austria, 2018.
Truong, L., Barik, R., Totoni, E., Liu, H., Markley, C., Fox,
A., and Shpeisman, T. Latte: A language, compiler, and
runtime for elegant and efficient deep neural networks. In
Proceedings of the 37th ACM SIGPLAN Conference on
Programming Language Design and Implementation, PLDI
’16, pp. 209–223, New York, NY, USA, 2016. ACM. ISBN
978-1-4503-4261-2. doi: 10.1145/2908080.2908105. URL
http://doi.acm.org/10.1145/2908080.2908105.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
tion is all you need. CoRR, abs/1706.03762, 2017. URL
http://arxiv.org/abs/1706.03762.
Wahib, M. and Maruyama, N. Scalable kernel fusion for memory-
bound gpu applications. In Proceedings of SC’14, New Or-
leans, LA, USA, 2014.
Wu, H. C., Diamos, G., Cadambi, S., and Yalamanchili, S. Kernel
weaver: Automatically fusing database primitives for efficient
gpu computation. In Proceedings of 45th Annual IEEE/ACM
International Symposium on Microarchitecture, Vancouver,
BC, Canada, 2012.
Xiong, D. Y., Li, J. H., Branco, A., Kuang, S. H., and Luo,
W. H. Attention focusing for neural machine translation by
bridging source and target embeddings. In Proceedings of the
56th Annual Meeting of the Association for Computational
Linguistics, ACL 2018, Melbourne, Australia, July 15-20,
2018, Volume 1: Long Papers, pp. 1767–1776, 2018. URL
https://aclanthology.info/papers/P18-1164/p18-1164.
Yu, D., Eversole, A., Seltzer, M., Yao, K., Kuchaiev, O., Zhang,
Y., Seide, F., Huang, Z. H., Guenter, B., Wang, H. M., Droppo,
J., Zweig, G., Rossbach, C., Gao, J., Stolcke, A., Currey, J.,
Slaney, M., Chen, G. G., Agarwal, A., Basoglu, C., Padmilac,
M., Kamenev, A., Ivanov, V., Cypher, S., Parthasarathi, M.,
Mitra, B., Peng, B. L., and Huang, X. D. An introduction to
computational networks and the computational network toolkit.
Technical report, 2014.
