FusionStitching: Boosting Memory Intensive Computations for Deep
  Learning Workloads by Zheng, Zhen et al.
FusionStitching: Boosting Memory Intensive
Computations for Deep Learning Workloads
Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu,
Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
Alibaba Group
{james.zz, pengzhan.zpz, guopinglong.lgp, feiwen.zfw, tashuang.zk,
kevin.zwy, lansong.dls, muzhuo.yj, weilin.lw}@alibaba-inc.com
Abstract
We show in this work that memory intensive computations
can result in severe performance problems due to off-chip
memory access and CPU-GPU context switch overheads in a
wide range of deep learningmodels. For this problem, current
just-in-time kernel fusion and code generation techniques
have limitations, such as kernel schedule incompatibilities
and rough fusion plan exploration strategies. We propose
FusionStitching, a Deep Learning compiler capable of fusing
memory intensive operators, with varied data dependencies
and non-homogeneous parallelism, into large GPU kernels
to reduce global memory access and operation scheduling
overhead automatically. FusionStitching explores large fusion
spaces to decide optimal fusion plans with considerations of
memory access costs, kernel calls and resource usage con-
straints.We thoroughly study the schemes to stitch operators
together for complex scenarios. FusionStitching tunes the op-
timal stitching scheme just-in-time with a domain-specific
cost model efficiently. Experimental results show that Fu-
sionStitching can reach up to 2.78× speedup compared to
TensorFlow and current state-of-the-art. Besides these exper-
imental results, we integrated our approach into a compiler
product and deployed it onto a production cluster for AI
workloads with thousands of GPUs. The system has been
in operation for more than 4 months and saves 7,000 GPU
hours on average for approximately 30,000 tasks per month.
Keywords: deep learning, kernel fusion, code generation
1 Introduction
Recent years have witnessed a surge of industry scale appli-
cations of deep learning models, ranging from images/videos,
text/NLP, to billion scale search and recommendation systems[50].
Suchworkloads are typically expressed as computation graphs,
and mapped to hardware through domain specific frame-
works (TensorFlow[2], PyTorch[1], MXNet[11], etc). Given
the flexibility and expressiveness of modern execution frame-
works, there are still challenges regarding to transforming
high level computation graphs into efficient kernels to maxi-
mize the underlying hardware execution efficiency.
Preprint. Under review, 2020
2020.
Many current research works mainly focus on dense ten-
sor computations (GEMM and convolution)[6, 12, 34, 39, 47]
as dense computations dominate the execution time formany
DNN workloads (like CNN[20, 38, 42]). However, recent ad-
vancement of the deep learning domain has resulted in many
novel model structures in which memory intensive patterns
occupies a large proportion of time. (In this paper, we refer
to GEMM and convolution as compute intensive op, and
other operators asmemory intensive ops, such as element
wise[43], transpose[45] and reduction[44]). In addition, the
amount of memory intensive operators in modern machine
learning models can be very large, causing notable GPU
kernel launch and framework scheduling overhead. Table 2
contains the collected metrics of various models with Ten-
sorFlow implementation. The execution time of memory
intensive ops can be more than that of compute intensive
ops in some cases, and the kernel calls can be up to 10,406.
For these workloads, optimizing compute intensive ops alone
is inadequate to unlock the full performance potential.
Existing human-crafted computation libraries, such as
cuDNN/cuBLAS, handle compute intensive ops as the pattern
of these ops are usually stable. While it is not feasible to
build library for flexible and fast-changing memory intensive
patterns. Some code generation frameworks, like TVM[12],
mainly focus on tuning compute intensive ops and do not
address memory intensive ones specifically.
A common approach to address memory intensive pat-
terns is computation fusion, a technique to fuse multiple
ops into a single kernel to reduce off-chip memory accesses.
Prior works have explored the basic idea in AI workloads[8,
12, 28], database[51], image processing[4, 33, 34], and HPC
applications[27, 49]. However, how to fuse kernels, with
unpredictable varied dependencies and non-homogeneous
parallelism, just-in-time (JIT) efficiently is still an open prob-
lem.
Existing JIT kernel fusion techniques use simple and straight-
forward strategy to explore the fusion possibilities and thus
lose optimization potential. As for memory intensive ops, the
rapidly evolving AI models introduce diverse and complex
combination patterns. The existing works lack the ability to
fuse and optimize complex patterns with irregular dimen-
sion changing (due to varied shapes and layouts of tensors)
and complex inter-thread dependencies (like reduction). This
ar
X
iv
:2
00
9.
10
92
4v
1 
 [c
s.D
C]
  2
3 S
ep
 20
20
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
limitation is imposed by the potential incompatibilities be-
tween the to-be-fused kernels and the simple fusion pattern
searching strategies with existing techniques.
We propose FusionStitching , an JIT optimization frame-
work to systematically perform fusion space exploration
and generate high-performance kernel code efficiently. Fu-
sionStitching addresses limitations of current approaches
by composing a large set of ops with diverge and complex
patterns into one GPU kernel. This is effective to reduce
off-chip memory accesses and context switch overhead. Fu-
sionStitching addresses two main challenges of aggressive
JIT fusion.
The first challenge is to find the optimal fusion plan given
complex op graph, that is which ops should be fused to-
gether. Note that naive composition of multiple computa-
tions may cause notable performance slowdown, as different
portions of the kernel may have conflicting memory layout,
parallelization or on-chip resource requirements[57]. A deep
learning computation graph usually brings huge search space
about operation combinations. It is not feasible to evaluate
all possible combinations with complexity of up to O(2V )
where V is the number of ops in the computing graph. We
formulate the fusion plan searching as an approximate dy-
namic programming problem with complexity of O(V + E),
where E is the number of edges in the graph. The approxi-
mate dynamic programming process produces a limited set
of promising fusion patterns. With a light-weight cost model,
FusionStitching generates the overall fusion plan with these
fusion patterns.
The second challenge is how to generate efficient GPU
kernels given unpredictable complex fusion plan just-in-time.
FusionStitching can generate complex fusion patterns consist-
ing of a broad range of memory intensive ops with various
dimension, layout characteristics and dependence relation-
ships. Fusing these computations all together while fully
leveraging computing and memory resources is non-trivial.
We systematically summarize 4 types of op composition
schemes covering the main patterns for memory intensive
ops in machine learning workloads, including independent,
intra-thread, intra-warp and intra-block dependence sce-
narios. FusionStitching enumerates schedules for ops in the
target fusion patterns, and tries various launch dimension
and op stitching scheme settings automatically. With a finely-
designed cost model, FusionStitching selects the schedule and
parallelism configurations with the best estimated perfor-
mance and emit the final GPU executable code.
We use a two-layer cost model in FusionStitching . A light-
weight cost model is designed for fusion pattern exploration
which faces a large searching space, and a well-designed cost
model is applied for code generation which requires accurate
performance estimation.
We realize FusionStitching into TensorFlow as an alterna-
tive of XLA[28], state-of-the-art compilation optimization
framework about memory intensive ops. All the optimiza-
tions are opaque to users. FusionStitching supports JIT opti-
mization for both training and inference.
We evaluate FusionStitching on a set of common machine
learning models, ranging from natural language processing,
OCR, speech recognition to searching and recommendation
models. FusionStitching achieves up to 2.78× speedup com-
pared with TensorFlow and XLA.
In summary, this work makes the following contributions:
• It thoroughly studies kernel composition schemes for
memory intensive ops in machine learning workloads,
and proposes an approach to generate efficient GPU
kernels just-in-time given very complex op composi-
tion patterns.
• It proposes a JIT compilation technique to explore
good fusion plans in the huge search space of op com-
position given a complex op graph, along with a two-
layer cost model of GPU kernels.
• It provides an industry level realization that is opaque
to users and evaluates with various modern machine
learning models on production cluster.
2 Motivation
Severe Context Switch Overhead We profile a variety of
machine learning workloads and find context switch over-
head between CPU and GPU a severe problem. Table 2 shows
the breakdown information of a wide range of machine learn-
ing workloads. The count of GPU kernel calls for TensorFlow
implementation can be up to 10,406, which causes severe
kernel launch overhead. Even with XLA, the kernel calls can
be still up to 6,842. As a result, the scheduling and pre-/post-
processing time on CPU introduced by the machine learning
framework dominants the execution time for some models
(like BERT inference, DIEN, ASR and CRNN ).
Large Portion of Memory-intensive Ops It is necessary
to pay attention to memory intensive ops specifically. As is
shown in Table 2, the execution time of memory intensive
ops can be up to 40% in the overall time for some model. The
global memory traffic time usually occupies a large portion
of the overall execution time of memory intensive ops.
By fusing operations aggressively, with carefully designed
code generation approach, we are able to reduce the CPU-
GPU context switch overhead and leverage the high speed
on-chip memory to transfer intermediate data between ops.
3 Overview
We explore the basic idea with NVIDIA GPU in this paper.
To prevent the combinatorial explosion when searching
for the optimal solution, we divide the optimization process
into two stages: fusion exploration and code generation. The
two stages are conceptually independent but highly related.
FusionStitching Preprint. Under review, 2020
Computation
Graph	
Fusion	Explorer
IR
Emitter
Code	Generator
compose
candidate	
patterns
fusion	plan
Sub-plan	Searching
cost	model	-	I
DP	process
Kernel	Estimation
Dimension	setting	
&	stitching	decision	
Cost	model	-	II
enumerate	
schedules
Kernel
Code
Figure 1. FusionStitching Overview.
As shown in figure 1, FusionStitching consists of fusion
explorer and code generator. Fusion explorer generates pos-
sible fusion candidates and selects the best fusion plan. A
fusion plan reveals how operators are grouped together
and each group will be eventually mapped to a single GPU
kernel. Code generator generates GPU kernels for each group
produced by fusion explorer. We pre-define a set of schedules
explicitly describing the behaviors of each kind of operators.
FusionStitching enumerates all combinations of the schedules
for ops in a fusion pattern, along with different parallelism
dimensions and data transferring settings. With a cost model
that estimates the performance of each schedule and dimen-
sion setting, FusionStitching selects the best configuration of
the fusion pattern and generates GPU kernel code.
We design a two-level cost model in FusionStitching . Fu-
sion explorer needs to search in huge searching space and
applies level-I cost model (4.4), which is fast but less accu-
rate. Note that fusion exploration do not need fine grained
estimation of each individual op because the performance
effect of merging different ops together matters more. Code
generator operates on merged GPU kernels and need more
accurate performance estimation, and thus we apply level-II
cost model which is more accurate but slower.
Reducing global memory transaction is a key factor for op-
eration fusion. FusionStitching leverages four types of kernel
composition schemes (5.1) to stitch ops together. The four
schemes can support most memory intensive operators.
4 Fusion Exploration
We formulate fusion exploration as a subgraph searching
problem (4.1). Each subgraph is a candidate fusion pattern.
We propose a approximate dynamic programming approach
to search for a set of promising fusion patterns(4.2). The
1)	candidate	sub-plans	(top-3	patterns)
								for	vertex	5:	{5},	{5,2},	{5,2,1}
								for	vertex	6:	{6},	{6,3},	{6,3,1}
								for	vertex	7:	{7},	{7,4},	{7,4,1}
2)	divide	groups	of	vertex-8's	consumers:	{5,6},	{7}
3)	candidate	patterns	considering	group	{5,6}:
								{8}
								{8,5},	{8,5,2},	{8,5,2,1}
								{8,6},	{8,6,3},	{8,6,3,1}
								{8,5,6},	{8,5,6,3},	{8,5,6,3,1}
								{8,5,2,6},	{8,5,2,6,3},	{8,5,2,6,3,1}
								{8,5,2,1,8},	{8,5,2,1,6,3}
				get	top-3:	{8,5,2,1,6,3},	{8,5,2,1},	{8,6,3,1}
4)	candidate	patterns	considering	group	{7}:
								{8},	{8,7},	{8,7,4},	{8,7,4,1}
				get	top-3:	{8,7,4,1},	{8,7,4},	{8,7}
5)	merge	3)	and	4)	and	get	overall	top-3:
				{8,5,2,1,6,3,7,4},	{8,5,2,1,7,4},	{8,6,3,1,7,4}
0
1
2 3 4
5 6 7
8
9
Figure 2. Fusion exploration case to generate candidate sub-
plans for vertex-8.
searching process is guided by a light-weight domain-specific
cost model (4.4). FusionStitching finally generates the overall
fusion plan by greedily selecting fusion patterns (4.3).
4.1 Fusion Problem Definition
For computation graph G = (V ,E), where V and E are
sets of vertices and edges respectively. We define a fusion
pattern Pi = (Vi ,Ei ) as a subgraph of G, with Vi ⊆ V ,
Ei ⊆ E. A fusion plan is a set of disjoint fusion patterns
S = {P0, · · · , Pk−1}. We define f (Pi ) as the score function of
Pi . The higher the performance is, the larger f (Pi ) is. So the
goal of computation fusion problem is to find fusion plan S
with maximal
∑k
i=1 Pi .
4.2 Explore Fusion Patterns
ABrute-force way to enumerate all fusion patterns has a com-
plexity of up to O(2V ). To efficiently navigate and find the
optimal fusion pattern in a large search space, we proposed
an exploration algorithm based on approximate dynamic
programming with complexity of O(V + E).
The basic idea of fusion exploration is that, we generate
candidate sub-plans for each vertex in the graph, and select
and compose final fusion plan with these sub-plans. The
candidate sub-plans for vertexVi is a set of fusion patterns
whose producer node is Vi . We only explore several top
patterns in candidate sub-plans for each vertex according to
score function f . We describe how we generate candidate
sub-plans for each vertex in this section and describe how to
compose the final fusion plan in section 4.3
Given a computation graph G(V ,E), we get a topological
sorting list. We generate candidate sub-plans for vertices in
post-order, from the last vertex to the first vertex.
We describe the approach with an example shown in Fig-
ure 2, where V0 is the root vertex who produces the output
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
0
1
2 3 4
5 6 7
8
9
h
Figure 3. Fusion merge of independent patterns with remote
vertices.
of whole graph. Assume that candidate sub-plans for vertices
before V8 have already been generated, and each candidate
sub-plans contains 3 top patterns.V8 has 3 consumers and we
will generate candidate sub-plans with its consumers’ infor-
mation. A naive approach is to combine V8 with all possible
combinations of all patterns in its consumers’ candidate sub-
plans, and select the top 3 patterns within the combinations.
However, the combinations will be huge when the consumer
number and top k setting is large. Instead, we design an ap-
proximate divide-and-conquer process to find top 3 patterns
with limited complexity, which we call group reduction.
We first divide the consumers ofV8 into several groups and
find candidate sub-plans forV8 only considering these groups
one by one, and finally compose final candidate sub-plans by
reducing the above results of all group. We assume the group
size in this case is 2 (group {V5,V6} and group {V7}). For group
{V5,V6}, we enumerate all possible combinations of patterns
in candidate sub-plans of V5 and V6. Specifically, there are 7
possible combinations between V5 and V6, including empty
set. We append V8 to each of the 7 combinations and select
the top 3 patterns as the temporary candidates associated
to group {V5,V6}. We get another top 3 patterns considering
group {7} and select the final top 3 patterns as the candidate
sub-plans for V8 from all above 6 temporary patterns. Note
we validate top patterns according to score function f .
The group reduction process above is recursive if con-
sumers number of a vertex is very large. Algorithm 1 shows
the formalized process of group reduction.
After the generation of candidate sub-plans for all vertices
in the graph, we add one more step to merge independent
patterns into one pattern. As is shown in Figure 3, we add
a virtual vertex h as the producer for all vertices and ap-
ply group reduction. We finally get the candidate sub-plans
of Vh , which includes the fusion of independent patterns
with remote vertices. The remote vertices fusion helps to
reduce generated kernels and thus reduce the context switch
overhead between CPU and GPU.
Algorithm 1 Fusion Exploration Algorithm
1: Input: Computation Graph G(V ,E)
2: Output: A set of valid fusion patterns S
3: procedure Explore(G)
4: D ← InitiazeBuffer()
5: S ← {}
6: I ← TopologicalSort(G) // Sorted indices
7: for i in I do
8: C ← GetConsumerIndex(G, i)
9: D[i] ← GroupReduction(D.select(C), C , i)
10: S ← S ∪ D[i]
11: end for
12: return S
13: end procedure
14:
15: procedure GroupReduction(D, C , i)
16: if n(C) = 1 then
17: return D[0]
18: end if
19: D∗ ← InitiazeBuffer()
20: C∗ ← []
21: G ← Group(C)
22: for j in 0 · · ·n(G) − 1 do
23: Y ← {{i}}
24: form in G[j] do
25: Y ← Y × (D[m] ∪ ∅)
26: end for
27: Y ← Sort(Y , f ) // Sort Y and select first k items
28: D∗[j] ← Y .first(k)
29: C∗.append(j)
30: end for
31: GroupReduction(D∗, C∗, v)
32: end procedure
C
A
B AC B
Figure 4. Cyclic Dependence: A cyclic dependence occurs
after fusing nodes A and C together
An constraint of a fusion pattern is that, no cyclic depen-
dence is allowed. Figure 4 shows an example that a cyclic
dependence occurs after fusion. FusionStitching discards pat-
terns with cyclic during the searching process.
4.3 Generate Overall Fusion Plan
All patterns in candidate sub-plans of all vertices forms a
new set E. FusionStitching will greedily select disjoint fusion
patterns from E to form the overall fusion plan.
FusionStitching Preprint. Under review, 2020
Specifically, FusionStitching maintains a set S to store the
final fusion plan. FusionStitching selects the fusion pattern in
E that have the highest f score and is disjoint to patterns in
S continuously. The result of S is the final fusion plan. The
remaining vertices not included in S will not be fused.
4.4 Fusion Pattern Evaluator: Level-I Cost Model
FusionStitching applies level-I cost model to form the score
function f . One insight is that, we only need to estimate the
performance gain or loss when form a fusion pattern, but
do not require accurate estimation of the overall execution
time. With this insight, the score function f represents the
performance gain or loss of a fusion pattern only.
There are three main factors in level-I cost model: reduced
memory access latency, reduced CPU-GPU context switch
overhead, and performance penalty of kernel fusion. The
score function f is the summary of these three parts:
f = Tr educed_mem +Tr educed_calls −Tpenalty (1)
We estimate reducedmemory access latency (Tr educed_mem )
with two factors. The first is the amount of memory traffics
between operators to be fused. The second is the change
of memory type to store the intermediate values between
operations. We get the memory traffic amount according to
the shape of input and output tensors. We build a regression
model to predict the reduced memory access latency when
change the memory type from global memory to register or
shared memory, when given memory traffic amount. The
regression model is based on latency data we collected offline
on various GPU vendors with various amount of data traffic.
The parallel setting of GPU kernels affects memory access
behavior. As the parallel setting is not determined until code
generation, we assign each operator a preferred configura-
tion according to its tensor shape for the cost estimation.
We estimate Tr educed_calls by multiply reduced kernel
numbers with a fixed value representing average CPU-GPU
context switch time we collected.
The performance penalty (Tpenalty ) mainly comes from
resource constraint on multiprocessors and poor compatibil-
ity between operators. Shared memory is the main resource
constraint that affects the performance. We use the maxi-
mum shared memory usage in or between any operators
within a fusion pattern to stand for the overall shared mem-
ory usage in level-I cost model. We get the upper bound of
the maximum concurrent threads according to the shared
memory usage for each fusion pattern. When merging oper-
ators reduces the parallelism, we estimate the performance
penalty according to the parallelism and total instruction
count. Moreover, once a fusion pattern exceeds the shared
memory limitation, this pattern is set as invalid. The poor
compatibility between operators happens when operating
dimensions differs. When an operator on a small tensor is
fused with another operator on a large tensor, there will be
warp
shufflereg reg shmem
gmem
gmem
gmem
gmem
Kernel
Packing
Thread
Composition
Warp
Composition
Block
Composition
Op	A
Op	B
Figure 5. The four types of composition schemes when
stitching op A and B together. Gmem means global memory
and different square indicates different memory address. Reg
means register and Shmem means shared memory.
wasted threads if the overall parallelism is accommodated
to the larger tensor. We estimate this kind of performance
penalty according to the reduced parallelism and the total
instruction count, like the effect of shared memory.
5 Code Generation
Code generator takes a fusion pattern as input, and produces
a GPU kernel that implements the fused operators. It is non-
trivial to fuse multiple ops into one high performance GPU
kernel due to various dependence scenarios and parallelism
incompatibilities.
The combination pattern of memory intensive operators
in machine workload is numerous, but basic kind of memory
intensive ops is limited (basic element-wise, reduce, broad-
cast, scatter/gather et.al.). We pre-define a set of schedules for
each kind of memory intensive ops with the consideration
of various dimension configurations and memory resource
requirement. The left problem of code generation is how to
stitch different operators into one kernel and what schedule
each individual op applies.
We first systematically investigate four kernel composi-
tion schemes (5.1) that covers common execution patterns
for memory intensive operations, and study the stitching de-
cision given two ops with specific schedule (5.2). We used an
automatic generation solution based on performance mod-
eling (5.4) to find good schedules and generate code for the
fusion pattern (5.3). The basic idea of code generation for
fused operators is to maximum the overall parallelism and
use high performance memory as much as possible.
5.1 Kernel Composition Schemes
We study about four kernel composition schemes, which
indicate main behaviors of common memory intensive ops.
Different scheme indicates different data dependence and
parallel behaviors of kernels to fuse, ranging from no de-
pendence to complex cross-thread dependence, and from
uniformed parallelism to non-homogeneous computations.
Figure 5 illustrates the four kind of composition schemes.
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
Kernel Packing packs computations of ops with no data
dependence. This scheme is instrumental in reducing context
switch overhead of kernel launch and framework scheduling.
It also reduces loop control overheads in instruction level. To
reduce control flow overhead, we perform aggressive loop
fusion [5, 19] to merge as many element-wise ops as possible
into a single loop structure when these ops have the same
parallelism dimension.
Thread Composition fuses data dependent ops within
a local thread context. Intermediate results are transferred
via registers. For modern GPUs with large register files, this
enables composition of many element-wise ops into large
fused kernels to reduce the overhead of globalmemory access
and CPU-GPU context switch.
WarpComposition fuses operators that have inter-thread
communication within warp level, which usually occurs in
some special reduction patterns. This scheme employs warp
shuffles to communicate between threads within a warp. A
common case is warp reduction, which applies warp shuffle
to do reduction and leverages registers to transfer intermedi-
ate results to dependant element-wise ops. It can be applied
to common deep learning building blocks such as softmax,
batchnorm, layernorm structures and their variants.
Block Composition unlocks the potential to enable com-
posing non-homogeneous computations into large fused
kernels, as long as these computations can communicate
within block level. It makes use of shared memory to trans-
fer intermediate results. This is a flexible scheme as it allows
different ops to execute in independent schedules in the
fused kernel. This scheme is essential to compose a broad
range of op kinds with various parallelism characteristics
and dependence relationships efficiently.
As far as we know, FusionStitching is the first project to
thoroughly study about all above composition techniques
for just-in-time compilation of memory intensive operations
for machine learning workloads. Some previous works ex-
plored static thread and block compositions in database[51],
image processing[4, 33, 34], and HPC applications[27, 49].
TensorFlow XLA[28] framework implements kernel packing
and thread composition. We do not stitch ops that involves
inter-block communications as it results in global memory
level synchronization and introduces high overhead.
An extra benefit of transferring intermediate data through
on-chip memory is that, it reduces the requirement of GPU
global memory allocation for intermediate data buffering.
This allows users to support large models and large batch-
size training. We leave this as a future study.
5.2 Stitching Decision
We choose the optimal composition scheme when stitching
two operators according to the data dependence relationship.
For independent ops, we do kernel packing and do loop fu-
sion if possible, as described in section 5.1. For one-on-one
Add Mul Reduce
BroadcastAddReduce
Thread
composition
Warp
composition
Thread composition
Thread
composition
Block
composition
Figure 6. A case of stitching decision. One-on-one depen-
dence between element-wise ops (Add, Mul and Broadcast
in this case) results in thread composition. Intra-warp level
reduce results in warp composition. Intra-block level reduce
results in block composition.
dependence, we apply thread composition to transformer in-
termediate data with register. Some special dependence can
be transferred to one-on-one dependence to leverage thread
composition, such as broadcast op.
If there is inter-thread dependence, we will apply either
warp composition or block composition. We first analyze the
dependence characteristics of the two ops to check whether
all inter-thread communication can be done within warp
level. If it is possible, we apply warp composition for the
two ops. Otherwise, we apply block composition and transfer
intermediate data through shared memory.
Figure 6 shows an examplewith different stitching schemes.
It applies warp composition for the first reduction, who in-
volves only intra-warp communication with its producer.
The second reduction involves block-level communication
and uses block composition. Other stitchings use thread com-
position.
Different schedule of adjacent ops may result in different
requirement of stitching schemes. FusionStitching decides
the stitching scheme along with schedule selection, which
we describe in Sec. 5.3.
5.3 Kernel Generation
Aswementioned before, we pre-defined a set of schedules for
each kind of operator. The various combinations of schedules
for different ops may result in different intermediate data
transfer schemes. Within each combination of schedules, we
then seperate the ops into a number of groups. In each group,
the ops with different enumerated schedules are feasible to
work under a unique dominant schedule. Meanwhile, differ-
ent schedule combinations and dominant op selections result
in different register and shared memory usage, which affect
the parallelism degree. Heavy usage of registers causes low
occupancy or even register spilling. We can not determine
operators’ schedules independently, but require a global view
of the fusion pattern.
FusionStitching enumerates all schedule combinations of
ops and dominant op selections in the target fusion pattern.
It then estimates these combinations according to level-II
FusionStitching Preprint. Under review, 2020
cost model (Sec. 5.4) and select the composition with the
highest estimated performance. To prune the searching space,
FusionStitching first identifies which ops fall into one-on-one
dependence, including relationships that can be transferred
to one-on-one dependence (like broadcast). The ops with
one-on-one dependence will always use the same schedule.
After the pruning process, the enumeration space is small
enough for the requirement of just-in-time compilation.
For each enumerated schedule compositions that needs to
be estimated, FusionStitching tunes the launch dimensions
and estimates the performance for every try. FusionStitching
makes the stitching decisions during tuning trials. It discards
the composition and launch dimension settings that cannot
be handled by the four composition schemes in section 5.1.
5.4 Code Generation Evaluator: Level-II Cost Model
Level-II cost model requires a more accurate estimation for
kernel performance than level-I cost model, with the draw-
back of expected slow execution of the model. Fortunately, as
the searching space of code generation is not very large, we
can tolerate a relative slow cost model. For the large granu-
larity fusion in FusionStitching , we use latency model rather
than throughput model to estimate the kernel performance.
This is because the output kernel of FusionStitching is still far
from the throughput performance limit as it composes many
non-homogeneous computations, even though we have op-
timized it a lot. Thus a throughput model is hard to estimate
the kernel performance accurately.
We build the level-II cost model as in equation 2, where L
is the estimated execution cycles of a fused kernel.
L = Nwave × Lwarp
Nwave =
Nwarp
Occupancy
Lwarp = Ninstruction ×CPI
(2)
Nwave means how many waves of warps that will be pro-
cess by a GPU card, noting that the warps for a large GPU
kernel will be spitted into several waves to be executed on
a GPU card where warps in the same wave executes con-
currently. Lwarp means the latency (cycles) a single warp
spends in the fused kernel. The multiply of Nwave and Lwarp
stands for the total cycles to execute the fused kernel.
We estimateNwave with the total number of warps to issue
(Nwarp ) and the occupancy of the fused kernel (Occupancy).
Nwarp is decided by the launch dimension. We calculate
Occupancy with launch dimension, shared memory usage
(Sec. 5.5) and estimated register usage. We estimate the reg-
ister usage by analyzing the life time of every intermediate
value and get the maximum register usage for a thread. This
approach is accurate enough for us to calculate Occupancy.
As for Lwarp , we use the reported CPI numbers [21, 22]
and multiply it with the total instruction count (Ninstruction )
we estimated.
5.5 Shared Memory Optimization
It is essential to use shared memory moderately as large
amount of shared memory usage hurts kernel parallelism,
especially for large granularity compositions. To use as much
shared memory as possible while not hurting parallelism, we
explore a dataflow based shared memory sharing technique.
The insight is that, FusionStitching reuses previous allocated
shared memory as much as possible to reduce unnecessary
shared memory allocation.
We use dominance tree algorithm[13] for shared memory
dataflow analyze. The approach takes a computation graph
and shared memory requests as input, and outputs an alloca-
tion map. To optimize shared space sharing, we traverse ops
of the computation graph in topological order. When an op
does not need shared space, previous allocation information
will be propagated forward. If an op needs shared space, we
merge allocation information of all its operands, test the
dominance relation to check if we can share any previously
allocated space for current op, and reuse the space if possible.
5.6 Computation Reuse Optimizations
One important optimization for code generation is inter-
mediate calculation reuse and memory access index reuse.
The former is to prevent redundant calculation of intermedi-
ate values. The latter is to reduce redundant calculation of
memory access index. The redundant computations mainly
comes from that different parts in a fusion kernel may use
different schedules, and index and some intermediate val-
ues are generated independently within each schedule in
previous approach. Before generating the code, FusionStitch-
ing first analyzes the overall index and intermediate value
characteristics and then organizes the output code to reuse
computations and data as much as possible.
6 Implementation
The fusion exploration and code generation techniques we
studied requires heavy implementation efforts and will be
a burden if left to users. We realize all the techniques we
discussed into TensorFlow backend. Users can make use of
all the techniques without changing any model script.
Specifically, we realize FusionStitching as a just-in-time
compilation pass in TensorFlow backend as a substitute of
XLA framework. Firstly, we alters the fusion pass in XLA
with fusion explorer in FusionStitching . Secondly, we modify
the IR Emitter logic in the original XLA service with the tech-
niques of code generator in FusionStitching . A TensorFlow
program which goes into the compilation backend will go
through FusionStitching process in this way.
7 Evaluation
7.1 Experimental Setup
In this section, we evaluate FusionStitching using a variety of
machine learning applications with different characteristics
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
Table 1.Workloads for evaluation.
Model Field Mode Batch Size
BERT NLP Both 32
DIEN Recommendation Both 256
Transformer NLP Training 4096
ASR Speech Recognition Inference 8
CRNN OCR Inference 8
0
0.5
1
1.5
2
2.5
3
BERT-train BERT-infer DIEN-train DIEN-infer Transformer ASR CRNN
Sp
ee
du
p
TF XLA FusionStitching
Figure 7. Performance speedup of FusionStitching .
in different fields. Table 1 summarizes the various fields of the
evaluated applications and the characteristics of these appli-
cations. These applications range from images (CRNN [37]),
speech (ASR[53]), NLP (Transformer[48], BERT [16]), to inter-
net scale E-commerce search and recommendation systems
(DIEN [58]). The building blocks of these workloads include
perceptron, attention, convolution, RNN and a broad range
of memory intensive operators.
To demonstrate the benefits of FusionStitching over pre-
vious work, we compare it with the default TensorFlow im-
plementation (TF) and XLA (up-to-date with community
functions), state-of-the-art of kernel fusion. All evaluation
results are collected on NVIDIA V100 GPUwith 16 GB device
memory. The server runs Red Hat Enterprise Linux 7.2 with
CUDA toolkit 10.2 and cuDNN 7.6.
7.2 End-to-End Performance
We evaluate the speedup of FusionStitching by comparing
inference cost or the training time of one iteration for TF,
XLA and FusionStitching with the same batch-size. During
our test, the accuracy in each iteration of training and the
result of inference are the same with TF and XLA. We re-
peat 10 times and use the average performance to validate
speedup. As for training workloads, we collect the execu-
tion time from the 11th iteration to the 20th (guaranteed to
be stable), to avoid the initialization overhead of the early
training iterations.
We show the speedup of FusionStitching in figure 7, where
the execution time of TensorFlow is normalized to 1. Com-
pared to TensorFlow, our approach achieves up to 2.78×
speedup, with 1.78× on average. Compared to XLA, our ap-
proach achieves up to 4.11× speedup, with 1.86× on average.
Note XLA shows performance degradation forASR andDIEN,
Table 2. Kernel execution breakdown. TF : naive Tensor-
Flow. FS: FusionStitching . CPU : the scheduling and pre-/post-
processingmetrics on CPU.Math: compute-intensive kernels.
Mem: memory-intensive kernels. Cpy: CUDA memcpy and
memset calls. E2E: the end-to-end time of one iteration (in
milliseconds). T : execution time. #: kernel calls number.
Model Tech T/# CPU Math Mem Cpy E2E
BERT
-train
TF T 1.55 41.69 28.45 0.15 71.84# - 98 561 102 761
XLA T 2.3 41.89 9.56 0.15 53.9# - 98 200 97 395
FS T 2.8 42.11 7.02 0.03 51.96# - 98 98 20 216
BERT
-infer
TF T 3.24 1.65 0.83 0.14 5.86# - 70 365 106 541
XLA T 0.78 2.50 0.60 0.13 4.02# - 98 277 94 469
FS T 0.59 2.46 0.40 0.04 3.49# - 98 77 30 205
DIEN
-train
TF T 90.13 7.77 32.54 7.12 137.56# - 1218 10406 1391 13015
XLA T 124.04 9.06 37.50 6.56 177.16# - 1215 6842 1996 10053
FS T 48.42 7.91 35.84 5.55 97.72# - 1215 2109 1395 4719
DIEN
-infer
TF T 27.36 2.58 7.55 1.99 39.48# - 406 3680 225 4311
XLA T 44.21 2.24 6.12 0.94 53.51# - 405 2585 627 3617
FS T 17.54 2.45 3.51 0.7 24.20# - 405 815 422 1642
Trans
former
TF T 7.99 109.13 69.53 1.63 188.28# - 309 3860 724 4893
XLA T 23.63 107.48 40.20 4.24 175.55# - 309 1923 2065 4297
FS T 8.21 110.70 42.57 3.05 164.53# - 243 1384 1765 3392
ASR
TF T 21.02 2.14 3.63 0.78 27.57# - 116 1292 534 1942
XLA T 17.51 1.66 1.81 19.76 40.74# - 84 496 376 956
FS T 6.00 1.92 1.63 0.36 9.92# - 108 212 199 519
CRNN
TF T 23.31 6.05 6.14 1.60 37.10# - 256 3674 890 4820
XLA T 12.17 0.30 11.37 1.04 24.88# - 7 993 406 1406
FS T 6.35 0.31 7.69 1.01 15.36# - 8 311 388 707
while FusionStitching does not show negative optimization
in any of these cases.
FusionStitching Preprint. Under review, 2020
We also test the inference workloads on NVIDIA T4 GPU
and get the similar speedup.
We apply FusionStitching in production and measure the
performance benefits. It shows that FusionStitching saves
about 7,000 GPU hours for about 30,0000 tasks in a month.
7.3 Performance Breakdown
Table 2 shows the kernel breakdown information, including
execution time (T ) of memory intensive ops (Mem), compute
intensive ops (Math), CPU time (CPU, kernel launch and
framework scheduling), CUDA memcpy/memset activities
(Cpy) and kernel call times (#). Note that the breakdown
profiling process is different with the process to measure the
end-to-end performance. This is because profiling introduces
some overhead and makes the end-to-end time not accurate.
Before we analyze the effect of our technique, we need to
point out that XLA affects the behavior of Matrix operations
(GEMM and GEMV). It tends to fuse GEMVs into GEMM,
and GEMMs to larger GEMM when there are Matrices shar-
ing common input. Some other algebra transformation and
loop-invariant code motion also reduces GEMM count. The
difference in GEMM count in table 2 is caused by such rea-
sons. Meanwhile, XLA affects the runtime behavior of Ten-
sorFlow and leads to more or less CUDA memcpy/memset
activities. FusionStitching ’s implementation is based on XLA
and exhibits the same behavior.
According to Table 2, we have the following observations.
Reduced context switch overhead. FusionStitching effec-
tively reduces the memory intensive kernel calls of all work-
loads, which results in reduced kernel launch and framework
scheduling overhead. As is shown in Table 2, the number of
memory intensive kernel calls with FusionStitching is 40.7%
of that with XLA in average, ranging from 27.8% to 72.0%. The
CPU time difference in table 2 indicates the reduced time due
to the decrease of kernel calls and CUDA memcpy/memset
activities. FusionStitching achieves up to 65.7% saving of the
CPU time comparing with XLA, 54.1% in average. Fusion-
Stitching reduces CUDA memcpy/memset activities than
XLA due to larger kernel granularity, with 39.5% decrease in
average.
Take DIEN-train as an example, the kernel call number for
memory intensive ops is 2109 with FusionStitching , and 6842
with XLA. Meanwhile, the CUDAmemcpy/memset activities
is reduced to 1395, comparing to 1996 with XLA. The final
CPU time with FusionStitching is significantly less than both
TF and XLA, thanks for the reduced kernel calls. Note that
XLA increases CUDA memcpy/memset activities and results
in severe performance drop here. FusionStitching avoids the
increased memcpy/memset calls due to larger kernel granu-
larity and do not suffer from the drawback. DIEN-infer has
the similar behavior. As for Transformer, both XLA and Fu-
sionStitching suffers from increased memcpy/memset calls.
However, FusionStitching suffers less than XLA and enjoys
the benefit of reduced memory intensive kernel calls, and
stitching-fusion
dim[128,1024]
ele-wise	ops
reduce
ele-wise	ops
reduce
ele-wise	ops
xla-fusion.7
dim[128,1024]
ele-wise	ops
reduce
xla-fusion.3
dim[128,1024]
ele-wise	ops
reduce
xla-fusion.2
dim[128]
ele-wise	ops
xla-fusion.1
dim[128,1024]
ele-wise	ops
XLA FusionStitching
Figure 8. The fusion pattern difference of XLA and Fusion-
Stitching for Layer Normalization.
thus the CPU time does not increase as much as XLA. Opti-
mizing kernel fusions while considering runtime behaviors
(like memcpy activities) could be a future research topic.
Reduced memory intensive op execution time. Fusion-
Stitching reduces the total execution time formemory-intensive
operations. The speedup of memory-intensive ops for all
workloads is 1.3× in average comparing with XLA, and up to
1.74×. The performance speedupmainly comes from reduced
global memory access. By fusing memory-intensive opera-
tions aggressively, the intermediate values can be cached in
registers and shared memory.
We measure the global memory traffic of memory inten-
sive ops for CRNN. It reads 667.6 MB global memory with
XLA, while FusionStitching reduces the metric to 225.8 MB.
About 66% global memory access has been reduced for mem-
ory intensive ops. The execution time of all memory inten-
sive computations thus achieves 1.48× speedup than XLA.
Overall speaking, FusionStitching supports more complex
fusion patterns than XLA with effective kernel generation,
which relaxes the fusion conditions and thus reduces the
final kernel numbers and intermediate global memory trans-
actions. With well controlled performance estimation and
reduced runtime memcpy activities than XLA, FusionStitch-
ing is less likely to result in bad case about optimization.
7.4 Fusion Pattern Analysis
We show the fusion pattern difference between XLA and
FusionStitching with Layer Normalization, which is a com-
mon component in deep learning models. Figure 8 shows
the fusion result of XLA and FusionStitching .
XLA forms four different fusion kernels. There are two
factors that prevents XLA to fuse operators with a larger
granularity in this case. The first is cross-thread dependence
(reduce op in this case). The second is varied parallelism
dimension (like between xla-fusion.2 and xla-fusion.1).
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
Instead, FusionStitching finds notable potential for larger
granularity fusion in this case and fuses all operators into
one kernel. In this way, the intermediate global memory
transaction and CPU-GPU context switch is avoided.
We collect the performance of all kernels of the two ver-
sion. FusionStitching achieves a speedup of 1.23×. Note we
do not count the context switch overhead for the 4 kernels
in XLA version, which further hurts the performance.
7.5 Overhead Analysis
FusionStitching is designed for tune-once-run-many-times
scenarios, which is a basic characteristic of deep learning
workloads. For the training that could take up to several
days, FusionStitching only needs to run in the first training
iteration. For an inference task, the executable kernels can
be prepared ahead-of-time. This is similar with XLA.
We measure the one-time overhead introduced by Fusion-
Stitching compared to XLA for training. The value is the JIT
compilation time of FusionStitching minus that with XLA.
Results show that the extra overhead is less than 30 minutes
for the workloads we evaluated in this paper, which is far
less than the overall training time.
A problem of both FusionStitching and XLA is that, they
cannot handle dynamic shapes, appears in some deep learn-
ing workloads, with low tuning overhead. The reason is that
the design of XLA service framework is not friendly to dy-
namic shape, while FusionStitching is implemented based on
XLA service framework. This implementation problem does
not affect the insight that FusionStitching shows.
7.6 Discussion about TVM
TVM [12, 56] is a popular framework targets optimization of
mainly computing intensive operators. People can provide
schedules and TVM tunes the parameters automatically. Re-
cent TVM development results in auto schedule generations
and limited fusion support.
Howeve, the optimization problem of memory intensive
op fusion is not the schedule for a single operator itself,
but how to group ops into fusion patterns and stitch them
together into a single kernel efficiently.
8 Related Work
GPUkernel fusion, inspired from classical loop optimizations[5,
18, 19], is known to boost performance in many application
domains. In database domain, Wu et al.[51] propose trans-
formations to fuse execution of multiple operators into a
single kernel. In the HPC domain, Wahib et al.[49] formu-
lates GPU kernel fusion as an combinatorial search problem,
and searches the solution space for an optimized fused kernel.
In image processing domain, recent works [32, 33] formulate
the image pipeline fusion as a graph cut problem. For ma-
chine learning workloads, Ashari et al.[8] proposes a kernel
fusion technique to generate efficient kernels for a specific
computation pattern. Li et al.[29] explores horizontal fusion
for GPU Kernels to increase the thread-level parallelism. Ap-
pleyard et al.[7], Diamos et al.[17] study about kernel fusion
technique to speedup RNN workloads. Abdolrashidi et al.[3]
propose a priority-based fusion approach and learn fusion
strategies with Proximal Policy Optimization[36] algorithm.
Astra[40] supports GEMM and basic element-wise op fusion.
The XLA compilation framework[28] can handle more gen-
eral computation patterns, but offers only basic capability
for fusion and kernel generation with empirical rules.
There are recent advances on code generation of compute
intensive DNN layers. TVM[12], Ansor[56] and Halide[34]
are capable to generate high performance kernels with well
designed schedules. Ansor[56] also explores kernel fusion
with tuning approach, with limited patterns supported. TC[47],
Glow[35] and Tiramisu[9] can generate accelerator kernels
given computation graph. Boda[31] is a code generator that
targets mobile platforms. Latte[46] is a DSL system allowing
users to specify, synthesize and optimize code. SLINGEN[41]
is another DSL system which takes mathematical specifica-
tions and generates optimized C functions for linear algebra
operators. Anderson et al.[6] propose a solution for selecting
fast kernel implementations in the global context by formu-
lating a PBQP problem. Kim et al.[26] propose a GPU code
generator for arbitrary tensor contractions. Cowan et al.[14]
study about generating quantized kernels. These researches
are relevant but complementary to our work.
In computation graph level, Jia et.al[23, 24] explores graph
substitutions to optimize the graph with equivalent transfor-
mation. FasterTransformer[10] provides hand tuned libraries
for common components in Transformer.
Performance models for GPU and other accelerators is
another research topic [15, 25, 30, 52, 54, 55]. Yet we design
a new cost model system with the consideration of fusion
and code generation requirements.
9 Conclusion
This work tackles the problem of optimizing memory in-
tensive operators in machine learning workloads. We show
that memory intensive operators is vital to end-to-end per-
formance of various deep learning models. We propose Fu-
sionStitching that supports to fuse operators, with complex
dependence and non-homogeneous parallelism, to reduce
memory access and context switch overhead just-in-time.
FusionStitching consists of fusion explorer and code generator.
The fusion explorer selects candidate fusion patterns from the
large searching space with appropriate computing complex-
ity, and produces a fusion plan with promising performance
expected. The code generator stitches operators with on-chip
memory as much as possible and tunes the schedules to emit
high performance GPU code for a given fusion pattern. A
two-layer cost model helps the searching and tuning pro-
cess of FusionStitching . Results show that FusionStitching
FusionStitching Preprint. Under review, 2020
outperforms TensorFlow and state-of-the-art deep learning
fusion techniques with up to 2.78× speedup. FusionStitching
has been deployed to production cluster running for more
than 4 months and saves about 7,000 GPU hours for about
30,000 tasks per month in average.
References
[1] 2015. Torch NN. https://github.com/torch/nn
[2] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu
Devin, et al. 2016. Tensorflow: Large-scale machine learning on het-
erogeneous distributed systems.
[3] Amirali Abdolrashidi, Qiumin Xu, Shibo Wang, Sudip Roy, and Yanqi
Zhou. 2019. Learning to Fuse. NeurIPS (2019).
[4] Asif M Adnan, Sridhar Radhakrishnan, and Suleyman Karabuk. 2015.
Efficient Kernel Fusion Techniques for Massive Video Data Analysis
on GPGPUs.
[5] Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern
architectures: a dependence-based approach. Taylor & Francis US.
[6] Andrew Anderson and David Gregg. 2018. Optimal DNN primitive
selection with partitioned boolean quadratic programming. In Pro-
ceedings of the 2018 International Symposium on Code Generation and
Optimization. 340–351.
[7] JeremyAppleyard, Tomas Kocisky, and Phil Blunsom. 2016. Optimizing
performance of recurrent neural networks on gpus. arXiv preprint
arXiv:1604.01946 (2016).
[8] Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald,
Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On opti-
mizing machine learning workloads via kernel fusion. ACM SIGPLAN
Notices 50, 8, 173–182.
[9] Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele
Del Sozzo, Patricia Suriana, Abdurrahman Akkas, Shoaib Kamil, Yun-
ming Zhang, and Saman Amarasinghe. [n.d.]. Tiramisu: A Polyhedral
Compiler with A Scheduling Language for Targeting High Perfor-
mance Systems. ([n. d.]).
[10] Ciprian Chelba, Mia Chen, Ankur Bapna, and Noam Shazeer. 2020.
Faster Transformer Decoding: N-gram Masked Self-Attention. arXiv
preprint arXiv:2001.04589 (2020).
[11] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,
Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015.
MXNet: A Flexible and Efficient Machine Learning Library for Hetero-
geneous Distributed Systems. CoRR abs/1512.01274 (2015).
[12] Tianqi Chen, ThierryMoreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan,
Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze,
et al. 2018. {TVM}: An automated end-to-end optimizing compiler
for deep learning. In 13th {USENIX} Symposium on Operating Systems
Design and Implementation ({OSDI} 18). 578–594.
[13] Keith D Cooper, Timothy J Harvey, and Ken Kennedy. 2001. A simple,
fast dominance algorithm. , 8 pages.
[14] Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and
Luis Ceze. 2020. Automatic generation of high-performance quantized
machine learning kernels. In Proceedings of the 18th ACM/IEEE Inter-
national Symposium on Code Generation and Optimization. 305–316.
[15] Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An
accurate GPU performance model for effective control flow divergence
optimization. In 2012 IEEE 26th International Parallel and Distributed
Processing Symposium. IEEE, 83–94.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
2018. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805 (2018).
[17] Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski,
Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev
Satheesh. 2016. Persistent rnns: Stashing recurrent weights on-chip.
In International Conference on Machine Learning. 2024–2033.
[18] Chen Ding and Ken Kennedy. 2004. Improving effective bandwidth
through compiler enhancement of global cache reuse. J. Parallel and
Distrib. Comput. 64, 1 (2004), 108–134.
[19] G Gao, Russ Olsen, Vivek Sarkar, and Radhika Thekkath. 1992. Collec-
tive loop fusion for array contraction. In International Workshop on
Languages and Compilers for Parallel Computing. Springer, 281–295.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
residual learning for image recognition. (2016), 770–778.
[21] Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza.
2019. Dissecting the NVidia Turing T4 GPU via microbenchmarking.
arXiv preprint arXiv:1903.07486 (2019).
[22] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza.
2018. Dissecting the nvidia volta gpu architecture via microbench-
marking. arXiv preprint arXiv:1804.06826 (2018).
[23] Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Za-
haria, and Alex Aiken. 2019. TASO: optimizing deep learning computa-
tion with automatic generation of graph substitutions. In Proceedings
of the 27th ACM Symposium on Operating Systems Principles. 47–62.
[24] Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei
Zaharia, and Alex Aiken. 2019. Optimizing dnn computation with
relaxed graph substitutions. In Proceedings of the 2nd Conference on
Systems and Machine Learning (SysMLâĂŹ19).
[25] Samuel J Kaufman, Phitchaya Mangpo Phothilimthana, Yanqi Zhou,
and Mike Burrows. 2020. A Learned Performance Model for the Tensor
Processing Unit. arXiv preprint arXiv:2008.01040 (2020).
[26] Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram
Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Roun-
tev, and Ponnuswamy Sadayappan. 2019. A code generator for high-
performance tensor contractions on gpus. In 2019 IEEE/ACM Interna-
tional Symposium on Code Generation and Optimization (CGO). IEEE,
85–95.
[27] Matthias Korch and Tim Werner. 2018. Accelerating explicit ODE
methods on GPUs by kernel fusion. Concurrency and Computation:
Practice and Experience 30, 18 (2018), e4470.
[28] Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled.
[29] Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2020.
Automatic Horizontal Fusion for GPU Kernels. arXiv preprint
arXiv:2007.01277 (2020).
[30] Sangkug Lym, Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee,
and Mattan Erez. 2019. DeLTA: GPU Performance Model for Deep
Learning Applications with In-depth Memory System Traffic Analy-
sis. In 2019 IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS). IEEE, 293–303.
[31] Matthew W Moskewicz, Ali Jannesari, and Kurt Keutzer. 2017. Boda:
A holistic approach for implementing neural network computations.
In Proceedings of the Computing Frontiers Conference. 53–62.
[32] Bo Qiao, Oliver Reiche, Frank Hannig, and Jürgen Teich. 2018. Auto-
matic kernel fusion for image processing DSLs. In Proceedings of the
21st International Workshop on Software and Compilers for Embedded
Systems. 76–85.
[33] Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From
loop fusion to kernel fusion: a domain-specific approach to locality
optimization. In 2019 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO). IEEE, 242–253.
[34] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly
Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo
Durand. 2017. Halide: decoupling algorithms from schedules for high-
performance image processing. Commun. ACM 61, 1 (2017), 106–115.
[35] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer
Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele,
Roman Levenstein, et al. 2018. Glow: Graph lowering compiler tech-
niques for neural networks. arXiv preprint arXiv:1805.00907 (2018).
Preprint. Under review, 2020 Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, and Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin
[36] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and
Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv
preprint arXiv:1707.06347 (2017).
[37] Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable
neural network for image-based sequence recognition and its applica-
tion to scene text recognition. IEEE transactions on pattern analysis
and machine intelligence 39, 11 (2016), 2298–2304.
[38] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
lutional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556 (2014).
[39] Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong
Zhou. 2019. Astra: Exploiting predictability to optimize deep learning.
In Proceedings of the Twenty-Fourth International Conference on Archi-
tectural Support for Programming Languages and Operating Systems.
909–923.
[40] Muthian Sivathanu, Tapan Chugh, Sanjay S Singapuram, and Lidong
Zhou. 2019. Astra: Exploiting predictability to optimize deep learning.
In Proceedings of the Twenty-Fourth International Conference on Archi-
tectural Support for Programming Languages and Operating Systems.
909–923.
[41] Daniele G Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and
Markus Püschel. 2018. Program generation for small-scale linear
algebra applications. In Proceedings of the 2018 International Symposium
on Code Generation and Optimization. 327–339.
[42] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
Zbigniew Wojna. 2016. Rethinking the inception architecture for
computer vision. (2016), 2818–2826.
[43] TensorFlow team. [n.d.]. tf.math.multiply. https://www.tensorflow.
org/api_docs/python/tf/math/multiply
[44] TensorFlow XLA team. [n.d.]. HLO Reduce Operation. https://www.
tensorflow.org/xla/operation_semantics#reduce
[45] TensorFlow XLA team. [n.d.]. HLO Transpose operation. https:
//www.tensorflow.org/xla/operation_semantics#transpose
[46] Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick
Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: a lan-
guage, compiler, and runtime for elegant and efficient deep neural
networks. In Proceedings of the 37th ACM SIGPLAN Conference on
Programming Language Design and Implementation. 209–223.
[47] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya
Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew
Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions.
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. In Advances in neural information processing
systems. 5998–6008.
[49] Mohamed Wahib and Naoya Maruyama. 2014. Scalable kernel fusion
for memory-bound GPU applications. In SC’14: Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis. IEEE, 191–202.
[50] Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao,
and Dik Lun Lee. 2018. Billion-scale commodity embedding for e-
commerce recommendation in alibaba. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Min-
ing. 839–848.
[51] Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yala-
manchili. 2012. Kernel weaver: Automatically fusing database prim-
itives for efficient gpu computation. In 2012 45th Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE, 107–118.
[52] Xuan Yang, Zhengrui Zhang, Guoliang Chen, Rui Mao, et al. 2019.
A performance model for GPU architectures that considers on-chip
resources: application to medical image registration. IEEE Transactions
on Parallel and Distributed Systems 30, 9 (2019), 1947–1961.
[53] Dong Yu and Li Deng. 2016. AUTOMATIC SPEECH RECOGNITION.
Springer.
[54] Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou,
and Mingyu Chen. 2017. Understanding the gpu microarchitecture to
achieve bare-metal performance tuning. In Proceedings of the 22nd ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming.
31–43.
[55] Yao Zhang and John D Owens. 2011. A quantitative performance
analysis model for GPU architectures. In 2011 IEEE 17th international
symposium on high performance computer architecture. IEEE, 382–393.
[56] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu,
Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen,
et al. 2020. Ansor: Generating High-Performance Tensor Programs for
Deep Learning. arXiv preprint arXiv:2006.06762 (2020).
[57] Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi,
and Wenguang Chen. 2017. Versapipe: a versatile programming frame-
work for pipelined computing on GPU. In 2017 50th Annual IEEE/ACM
International Symposium onMicroarchitecture (MICRO). IEEE, 587–599.
[58] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou,
Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network
for click-through rate prediction. In Proceedings of the AAAI conference
on artificial intelligence, Vol. 33. 5941–5948.
