On-Device Neural Net Inference with Mobile GPUs by Lee, Juhyun et al.
ar
X
iv
:1
90
7.
01
98
9v
1 
 [c
s.D
C]
  3
 Ju
l 2
01
9
On-Device Neural Net Inference with Mobile GPUs
Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan Shieh,
Fabio Riccardi, Raman Sarokin, Andrei Kulik, and Matthias Grundmann
Google Research
1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA
{impjdi,chirkov,eignasheva,ypisarchyk,moganshieh,fricc,sorokin,akulik,grundman}@google.com
Abstract
On-device inference of machine learning models for mo-
bile phones is desirable due to its lower latency and in-
creased privacy. Running such a compute-intensive task
solely on the mobile CPU, however, can be difficult due to
limited computing power, thermal constraints, and energy
consumption. App developers and researchers have begun
exploiting hardware accelerators to overcome these chal-
lenges. Recently, device manufacturers are adding neural
processing units into high-end phones for on-device infer-
ence, but these account for only a small fraction of hand-
held devices. In this paper, we present how we leverage the
mobile GPU, a ubiquitous hardware accelerator on virtu-
ally every phone, to run inference of deep neural networks
in real-time for both Android and iOS devices. By describ-
ing our architecture, we also discuss how to design net-
works that are mobile GPU-friendly. Our state-of-the-art
mobile GPU inference engine is integrated into the open-
source project TensorFlow Lite and publicly available at
https://tensorflow.org/lite.
1. Introduction
On-device machine learning (ML) offers a variety of
benefits. The most apparent is the improved inference la-
tency: By skipping the data upload to the server and wait-
time for the inference result, the app can respond more
quickly to the user’s request. Removing the server depen-
dency has additional benefits, such as:
• Removing the need to maintain inference servers,
• Running with limited or no connectivity, and
• Reducing privacy concerns as the user data remains on
the device.
However, on-device ML is not trivial. Despite both re-
cent advances in mobile hardware technology and efforts
to efficiently run deep networks on mobile devices, mo-
bile CPUs continue to be less powerful than those found
in servers. Running deep net inference on a mobile de-
vice means adding a significant compute-intensive task to
the CPU which competes with existing logic. Fully utiliz-
ing the mobile CPU comes with additional unwanted costs,
e.g. increased energy consumption leads to shorter battery
life and an increase in the phone’s thermal profile causes
throttling resulting in slower computation.
Hardware accelerators such as the digital signal proces-
sors offer solutions to overcome these challenges. The de-
mand for on-device ML has led to recent trends of phone
manufacturers integrating dedicated neural processing units
(NPUs) for high-end next-generation phones, which ac-
count for only a small fraction of the current distribution
of mobile devices.
Our primary goal is a fast inference engine with wide
coverage for TensorFlow Lite (TFLite) [8]. By leveraging
the mobile GPU, a ubiquitous hardware accelerator on vir-
tually every phone, we can achieve real-time performance
for various deep network models. Table 1 demonstrates that
GPU has significantly more compute power than CPU.
Device CPU (FP32) GPU (FP16)
Samsung Galaxy S5 79 300
Samsung Galaxy S7 124 730
Samsung Galaxy S9 270 730
Table 1. Example of available compute power on mobile in gi-
gaflops (billion floating point instructions per second). FP16 and
FP32 refer to 16- and 32-bit floating point arithmetic, respectively.
This paper presents the techniques we adopt for TFLite
GPU and how we achieve an average acceleration of 2–9×
for various deep networks on GPU compared to CPU infer-
ence. We first describe the general mobile GPU architec-
ture and GPU programming, followed by how we materi-
alize this with Compute Shaders for Android devices, with
OpenGL ES 3.1+ [16] and Metal Shaders for iOS devices
with iOS 9+ [1].
1
2. Related Work
Various research efforts from both academia and indus-
try endeavor to bring deep neural networks inference previ-
ously limited to server, forward to mobile devices. Those
efforts can be roughly categorized into three strategies:
• Network architecture-driven,
• Hardware-driven, and
• ML framework-driven.
Neural network researchers have focused on optimiz-
ing their network architectures explicitly for processing
on-device in various domains such as image classifica-
tion [10, 21], object localization [11], and image enhance-
ments [13, 14]. Many of these techniques involve reduc-
ing the model size by re-designing the network architec-
ture and adding pre-/post-training quantization of weights.
With these, one can achieve faster computation and smaller
memory footprint, leading to reduced inference latency at
the cost of slightly degraded model accuracy. MorphNet [9]
takes a unique path of reducing the number of floating point
operations per second which is optimized during training of
the model. Our work is complementary to these efforts and
instead focuses on optimizing the inference engine that runs
the neural network rather than the model or training.
Major hardware manufacturers have made architectural
changes responding to demands for faster mobile inference,
and are publishing software development kits (SDKs) to
expose those: Arm Compute Library [4], Huawei HiAI
SDK [12], MediaTek NeuroPilot SDK [17], and Qual-
comm SNPE SDK [20]. These libraries are vendor-specific
and either cannot be re-used on a different architecture
or do not guarantee the expected performance boost on
other platforms. Our work does not add new hardware or
SDKs. Instead, we use well-established hardware, the mo-
bile GPU, and well-supported graphics and compute stan-
dards as OpenGL [16] and Metal [1], to achieve high-
performance neural network inference.
Apple presented the Metal Performance Shaders with
support of convolutional neural networks [3] accelerated by
GPU. This is a solution built on top of the Metal API and al-
lows custom operations. Our approach is analogous to Ap-
ple’s on iOS devices. Apple also released CoreML [2], an
end-to-end solution for inference on mobile devices using
CPU, GPU, and NPU, if available.
Android introduced the Android Neural Networks API
[7] that serves as a layer between hardware and higher-level
ML frameworks that vendors must implement for Android
8.1 or later. Our work has wider coverage and does not
depend on a specific Android version, or require vendors to
implement individual APIs for deep network processing.
Some of the latest mobile-friendly ML frameworks are:
• Caffe2 [6] which focuses on CPU inference and uses
Arm Compute Library for Arm Mali GPUs.
• MACE [24] which employs OpenCL which is not a
part of standard Android OS.
TFLite GPU leverages the mobile GPU with OpenGL ES
for Android devices and Metal for iOS devices. The spe-
cific version requirements are OpenGL ES 3.1+ and iOS
9+ which are available for more than 52% of all Android
devices [23]. One of our biggest strength is that our frame-
work employs open standards, i.e. is not limited by specific
hardware vendor, and thus covers a wide range of devices.
3. General Architecture
This section explains the general architecture of TFLite
GPU, consisting of an initialization phase followed by a
model inference phase. The techniques in this section are
independent of the architecture of the underlying GPU.
3.1. Initialization
TFLite provides APIs for the delegation of the execution
of neural network sub-graphs to another library. We ex-
ploit this feature to integrate the GPU backend into TFLite.
Given a neural net model, TFLite first checks whether it can
execute all the operators in the model with our GPU dele-
gate. Our GPU backend identifies supported operators, and
TFLite then partitions the graph into several sub-graphs,
substituting the sub-graphs with virtual “delegate nodes”.
From that point, the GPU backend is responsible for exe-
cuting this sub-graph, as depicted in Figure 1. Unsupported
operators are by default computed by the CPU. Ideally, the
whole graph would be compatible with our mobile GPU
backend for maximum performance.
As our mobile GPU inference engine is primarily de-
signed for high-performance execution, we first inspect the
model and resolve obvious inefficiencies. For example:
• Merging PAD as an option of another op where it was
previously described separately.
• Removing superfluous identity operations, e.g. RESIZE
with scale one or single input ADD/CONCAT.
1x1Conv
3x3Conv
1x1DWConv
Add
Squeeze
Softmax
1x1Conv
1x1DWConv
3x3Conv
Add
Squeeze
Softmax
TFLite GPU
Compatible
operations
Operations 
not supported
by TFLite GPU
GPU 
Delegate
CPU 
Fall-back
Neural Network Graph Execution Graph
Figure 1. TFLite’s delegate mechanism: Operations supported by
the GPU delegate will run on the GPU, and the rest on the CPU.
While these inefficiencies might be caught by the architect,
artifacts such as these crop up inevitably, and we should still
optimize these whenever possible.
Note that, in contrast to CPU backends which work with-
out initialization, GPU backends require initialization in-
volving shader compilation and optimization by the driver
before inference. The cost of this process depends on net-
work size and may take from few milliseconds to seconds,
but is incurred once and not again for subsequent runs until
the cache memory is invalidated for any of reasons: appli-
cation is updated or re-installed, device is rebooted, cache
memory is over, or for other OS-specific reasons.
3.2. Running Inference
The inference phase is fairly straightforward. The input
tensors are reshaped to the PHWC4 format detailed later in
Section 4, if their tensor shape has channel size not equal to
4. For each operator, shader programs are linked by binding
resources such the operator’s input/output tensors, weights,
etc. and dispatched, i.e. inserted into the command queue.
The GPU driver then takes care of scheduling and executing
all shader programs in the queue, and makes the result avail-
able to the CPU by the CPU/GPU synchronization. There
might be a final conversion from PHWC4 to HWC, if the
output tensor has a channel size not equal to 4.
For maximum performance, one should avoid CPU/GPU
synchronization at all cost, and preferably, never leave GPU
context if real-time processing is needed. The most ideal
scenario would be the following: A camera provides with
RGBA texture that goes directly to TFLite GPU and the
output of the network is then directly rendered to the screen.
Shader Program Optimization In the GPU inference
engine, operators exist in the form of shader programs. The
shader programs eventually get compiled and inserted into
the command queue and the GPU executes programs from
this queue without synchronization with the CPU.
To reduce the number of shader programs in the com-
mand queue, we consolidate them into meaningful aggre-
gates while maximizing parallelism and well-defined data
dependencies. The following techniques are employed
when generating the source code for the shader programs:
• Fusing element-wise operators with computationally
expensive operators, e.g. activations with convolution,
to reduce the number of shader programs.
• In-lining parameters and small objects directly into the
shader program to reduce memory I/O overhead.
• Baking uniforms into the source code, instead of pass-
ing them in the run-time, allowing drivers to produce
more optimal code.
• Creating specialized version of shaders, like “convo-
lution with 1×1 kernel size”, to manually optimize
shaders for particular cases.
Figure 2. Example of PHWC4 memory layout (best viewed in
color). A tensor of shape (H=8,W=6, C=12) is split into 4-
element slices of size (H,W, 4) which are stored sequentially as a
continuous 2D array of size (HC/4=24, 4W=24).
• Implementing specialization of shader programs op-
timized for a certain architecture to improve the op’s
performance on the said environment.
After the source code for each program is generated,
each shader gets compiled. This compilation step can take a
while, from several milliseconds to seconds. Typically, app
developers can hide this latency while loading the model or
starting the app for the first time. Once all shader programs
are compiled, the GPU backend is ready for inference.
4. Data Layout
Most modern GPUs use a homogeneous coordinate [18]
system which represents points in space with coordinates
(x, y, z, w). A homogeneous coordinate (x, y, z, w), where
w 6=0, represents a point (x/w, y/w, z/w, 1) in a 3D space.
This allows affine transformations and projective transfor-
mations to be represented in the form of 4D matrix multi-
plications. GPUs are essentially processors optimized for
4-element vector compute and load/store operations.
While TFLite does not restrict tensors to a certain shape,
many operators assume 4D input/output tensors shaped as
[B,H,W,C] where B, H , W , C respectively represent
batch size, height, width, and channel size. For conve-
nience, the rest of the paper will mostly describe tensors
assuming a batch size of 1, or [H,W,C] for short. This sim-
plified example can be generalized if we consider batches to
be a concatenation of multiple [H,W,C] tensors.
In TFLite GPU, a [H,W,C] tensor is split into 4-channel
slices which are stored sequentially in memory. If the num-
ber of channels is not divisible by 4, it is paddedwith zeroes.
This memory layout, called PHWC4 (Figure 2), optimally
reduces cache misses in the graphics architecture. This is
tightly coupled with how compute threads are executed on
the GPU, which defines the order of computation, and more
importantly, the order of memory load instructions.
HW
C
Figure 3. Compute shader execution grid (X=12, Y=12, Z=8)
built upon the tensor shape (H=10,W=10, C=6) shown in blue
(best viewed in color). Work group size (x=4, y=4, z=4) high-
lighted as cubes with bold lines. Each cell represents a FP32 value.
4.1. Work Groups: GPU Threading Units
A GPU compute task consist of a shader program and
a grid. Every thread executes the same shader program,
but on different region of a 3D mesh problem space. The
global grid is made up of repeated work groups of constant
shape (x, y, z) and has a total dimension (X,Y, Z)which is
a multiple of these work groups.
Every operation in the graph has at least one output 3D
tensor. If there is more than one output tensor, we use one
of them as a basis for the compute grid size calculation.
The grid may be larger than the actual output tensor, be-
cause we expand it to sizes in multiples of 4 due to GPUs
working efficiently for those sizes. This causes the creation
of threads which do nothing and return at the beginning of
the main function, but this is faster than working with mis-
aligned grid sizes which prevents efficient optimization of
byte code. The described situation is visualized in Figure 3,
where blue color highlights useful threads which will actu-
ally calculate output values, and red color highlights stub
threads. Further tuning of the compute grid/work group
sizes is described in subsection 4.2.
Optimizations are focused on neighboring threads within
a work group - those spawned in sequential order as de-
scribed. The PHWC4 layout provides the advantage of al-
lowing neighboring threads to hit the same cache line when
requesting data for input tensors.
Threads inside a work group are executed in a particu-
lar order. Our experiments show that for each work group
channel, each row is sequentially picked in order from the
first to last, starting across W , then H and finally C. Or-
dering of work group execution is likewise sequential and
follows the same schema, as shown on Figure 3.
For a 2D Convolution, we compute the result at every
output element, by iterating over the weights of a convolu-
tion kernel and its corresponding input elements covered by
a window of size (kernel height , kernel width). For sim-
plicity, we consider the case of 1×1 convolution window
case. In this case, only one input cell is needed to calculate
Figure 4. Cache hit by 4 neighboring threads. When threads
T0–T3 each issue a 16-byte load of memory blocks i0–i3 that are
contiguous in memory, the first load can fill the 64-byte cache line,
benefiting the other threads with no additional cost in memory I/O.
one output element. As we work with 3D tensors, every cell
is implied to be a vector of channels. For this operation, ev-
ery thread at the very first iteration of its loop requests first 4
channels of the appropriate cell. A compulsory cache miss
occurs on the initial thread request (for 16 bytes, or 4 float
values), which triggers the actual data load. When this oc-
curs, the hardware memory manager loads the whole cache
line and not just the requested 16 bytes. Since the cache
line size on most mobile GPUs is 64 bytes, this results in
the loading of the next 48 bytes as well. Since all threads
execute the same shader code, the neighboring threads will
also execute the same code as the first one (the initially re-
quested 16 bytes). Organizing threads in the way is an effi-
cient strategy for memory loading as the next (neighboring)
input values will already be available when requested and
loaded as part of the same cache line for initial neighbor
compute threads (Figure 4).
4.2. Work Group Size Selection
The work group size for executing shader programs de-
fines the group of threads which share data inside the work
group. Depending on the GPU, picking the right work
group size can result in increased performance, whereby
picking the wrong can result in unexpected slowdowns.
Arm Mali GPUs, for instance, show robust performance in-
dependent of configured work group sizes and tuning them
only results in a nominal performance gain typically less
than 5%. Qualcomm Adreno GPUs, on the other hand, are
extremely sensitive to well-configuredwork group sizes and
tuning these can give up to a 30% performance boost.
Tuning the work group size is unfortunately difficult as
GPU internals are not available to the user either directly
(via the API), or indirectly (via some assembly representa-
tion of internal state). Threads are executed in groups called
“waves” and knowing the wave size is crucial to optimizing
the work group size as they fine-tune the memory usage of
neighboring threads. Devising an algorithmic selection of
optimal work group size thus becomes an exhaustive search.
Note that selecting the wrong work group size may slow
down execution by 5–7 times on Adreno GPUs.
Despite these challenges, we conducted extensive inves-
tigations into optimizing the work group size, focusing pri-
marily on CONV 2D and DEPTHWISE CONV, as these make
up nearly 90% of the workload for convolutional networks.
While the algorithmic solution is not perfect, the alternative
brute-force approach is impractical for real time applica-
tions because the work group investigation for a model may
take several minutes. In addition, measurements may be in-
consistent due to device temperature, resource racing, etc.,
causing the true global optimal work group size to change
from one inference to another.
Because of these fluctuations, we approximate a reason-
able optimum within the neighborhood region of the global
optimum given an inference time function T (W,C), where
W is work group sizes, and C identifies convolution con-
figuration. The domain of the function parameters are:
• Work groups dimensionsW : 2, 4, or 8
• Convolution configurationsC search space:
◦ CONV 2D weights 1×1, 2×2, 3×3, or
◦ DEPTHWISE CONV input and output shapes from
(8, 8, 8) to (128, 128, 128), and
◦ Strides 1×1, 2×2, 3×3
Given the search space defined by the convolution configu-
ration, a gradient descent approach allows us to converge on
a stable optimumwork groups where expected performance
varies 10% on every inference. From this region of stable
work groups, an approximate optimal work group can be se-
lected for every device and convolution type combination.
Work groups from the Table 2 are currently used in
TFLite GPU and their stability is statistically proven. While
they do not necessarily result in peak optimal time across all
parameters, they are reliable in giving top 10% performance
regardless of the convolution parameters.
Adreno GPU Model CONV 2D DEPTHWISE CONV
630 (4, 8, 4) (4, 4, 8)
540 (8, 2, 2) (8, 8, 2)
510 (8, 4, 4) (8, 4, 4)
509 (8, 4, 8) (8, 4, 2)
50X/4XX (8, 4, 8) (8, 4, 8)
Table 2. Optimal work group sizes for Adreno GPUs.
5. Memory Manager for Intermediate Tensors
While we allocate GPU memory for all input/output ten-
sors and tensors holding the trained weights, we do not allo-
cate memory for all intermediate tensors between the oper-
ators separately, as they do not have to co-exist in memory
simultaneously. This is an important optimization to reduce
the memory footprint of the GPU run-time.
During initialization, we first topologically sort the net-
work to determine the execution order of each operator, and
the correspondingly required tensors. For each intermediate
tensor, we can determine the first and the last operator that
uses this tensor either as input or output. Once the last “con-
sumer” of an intermediate tensor has finished executing, the
memory for the said intermediate tensor can be re-used for
other intermediate tensors. To minimize the total required
memory allocation, we have devised a strategy to determine
when this final operator execution has occurred. This prob-
lem is NP-complete [22].
We compared three algorithms for managing the inter-
mediate tensors: (a) a naı¨ve algorithm, (b) a greedy algo-
rithm, and (c) a minimum-cost flow algorithm. The first just
naı¨vely allocates all memory necessary and only serves as
a baseline for comparison. The latter two implement smart
memory management and use the concept of “shared ob-
jects” by which we refer to as allocated memory that is used
for more than one tensor during inference, but not more than
exactly one at a time. The size of the shared object is the
maximumof sizes of tensors that it is used for. For example,
if a shared object S is used for tensor a, re-used for tensor b,
and later for tensor c, the size of the shared object S needs
to be sizeS = max(sizea , sizeb , sizec).
The Greedy Algorithm is summarized in Algorithm 1.
We iterate through all operators in topological execution or-
der. If an output tensor of the current operator is an interme-
diate tensor, it is assigned to a newly created shared object
Algorithm 1 Greedy Memory Management
1: available objects ← ∅
2: used objects ← ∅
3: for each op ∈ operators do
4: for each t ∈ op.outputs do
5: if t is intermediate then
6: if available objects = ∅ then
7: S ← new shared object with size t .size
8: else
9: S ← available objects .find(t .size)
10: available objects .remove(S)
11: if t .size > S .size then
12: S .size ← t .size
13: t .shared object ← S
14: used objects .insert(S)
15: for each t ∈ op.inputs do
16: if t is intermediate and op is its last consumer then
17: S ← t .shared object
18: used objects .remove(S)
19: available objects .insert(S)
if the pool of shared objects is empty (L.7), or to an existing
shared object that has the closest size by absolute difference
to the t .size (L.9) which gets removed from the available
pool (L.10). If t .size > S .size , then the shared object’s
buffer size is increased (L.11–12). This shared object S is
inserted into the set of currently used objects (L.14). After
the output tensors, the input tensors are inspected. If an in-
put tensor is an intermediate tensor and the current operator
is the last consumer, we remove the shared object that is as-
signed to this tensor from the set of currently used objects,
and add it back to the pool of shared objects (L.17–19).
This algorithm has the runtime complexity ofO(n logn)
where n is the number of intermediate tensors. We use
binary search tree for the pool of shared objects and bi-
nary heap priority queue for the set of currently used ob-
jects. Straightforward implementation of the same algo-
rithm without these data structures has a run-time complex-
ity of O(n2). For the neural network from Figure 5, this
approach re-uses memory of output tensor of vertex 0 for
output tensor of vertex 2, and memory of output tensor of
vertex 1 for output tensor of vertex 4. The total size of allo-
cated memory is 104.
The Minimum-Cost Flow Algorithm involves creating
an auxiliary flow network and solving the minimum-cost
flow problem (MCFP) [5]. First, we insert two vertices for
each intermediate tensor x and denote them lx and rx with
two special vertices for the source s and the sink t. Then,
we add directed edges to the flow network:
1. For each x in 1..N , add an edge from s to rx with
capacity 1 and cost sizex . For tensor x, we can allocate
new shared object of size sizex .
2. If a shared object allocated for tensor x can be re-used
for tensor y, then add an edge from lx to ry with ca-
pacity 1 and cost max(0, sizey − sizex ). If tensor y
is greater in size than tensor x, we can re-use corre-
sponding shared object, but we might need to allocate
sizey − sizex of additional memory. This is not always
the case, when the shared object can already have a
size greater than sizex , but it is a good approximation.
3. For each x in 1..N , add an edge from s to lx with ca-
pacity 1 and cost 0.
0
32
1
8
3
8
2
4
4
64 5
Figure 5. An example neural net. Each vertex corresponds to an
op. The upper number denotes the execution order, and the lower
number the size of its output intermediate tensor. The last op does
not have the latter as its output is not an intermediate tensor.
S
I0
r0
t
32
8
4
32
56
8
64
I1
I2
I3
I4
r1
r2
r3
r4
Figure 6. The flow network for the neural network in Figure 5.
Capacity of each edge is 1. Saturated edges, i.e. the final assign-
ment of shared objects to tensors, are shown as solid lines.
4. For each x in 1..N , add an edge from rx to t with ca-
pacity 1 and cost 0.
After building the flow network, we solve the MCFP
with Shortest Path Faster Algorithm (SPFA) [19] or John-
son’s algorithm [15]. With SPFA, the run-time complexity
O(N4), but it can be reduced to O(N3) by decreasing the
number of edges of type 2. Figure 6 shows a flow network
and the result of this algorithm execution for example graph
from Figure 5. Minimum-cost flow approach re-uses mem-
ory of output tensor of vertex 0 for output tensor of vertex
4. The total size of allocated memory is 84.
If an edge of type 1 (from s to rx) is saturated by the
flow, i.e. its residual capacity is equal to 0, we create new
shared object for the tensor x. If an edge of type 2 (from
lx to ry) is saturated by the flow, we assign the same shared
object for tensor y that was used by tensor x. After execu-
tion of the algorithm, the amount of the flow will be equal to
N . It means that the resulting flow network has information
about the assignment of shared objects for all N intermedi-
ate tensors. Size of each shared object is determined by the
maximum size of all tensors assigned to it.
There is no clear winner between these two memory
management algorithms in terms of the minimal memory
footprint, and it depends on the network (Table 3). TFLite
GPU is using the greedy algorithm by default with the de-
veloper being able to choose theMCFP algorithm if desired.
Strategy MobileNet MobileNetV2 DeeplabV3
Naı¨ve 9.6 13.2 24.3
Greedy 2.3 4.0 3.6
MCFP 2.7 3.8 4.2
Table 3. Total memory allocated (in MB) for all intermediate ten-
sors. Naı¨ve means no memory manager and serves as baseline.
Bold number means the smallest memory footprint for each model.
Figure 7. Average inference latency (in milliseconds) of TFLite
GPU (orange) compared to CPU (gray) on various neural net-
works, run on a variety of smartphones (best viewed in color).
6. Results
Figure 7 illustrates the performance of GPU inference
compared to CPU inference in TFLite for various neural
networks which generally demonstrates a 2–9× speedup.
The first 10 warm-up runs were skipped for benchmarking
and averages are based on the 100 subsequent inferences.
This profiling revealed that TFLite GPU is often bound by
memory bandwidth and we typically only see 20–40%ALU
utilization. On iOS devices, we benefit from larger cache
sizes that result in reduced memory I/O latency, and hence,
better performance than the OpenGL backend.
Table 4 and Table 5 show the average inference latency
of iOS- and Android-compatible ML frameworks on Mo-
bileNet v1, respectively. Note that TFLite GPU employs
OpenGL for the widest coverage with reasonable perfor-
mance. MACE and SNPE employ OpenCL and may out-
perform TFLite GPU on some mobile devices shipped with
OpenCL. As OpenCL is not a part of the standard Android
distribution, apps using those frameworks may not be able
to guarantee their inference performance e.g. on Google
Pixel devices. Also note that SNPE does not run on devices
with Arm Mali GPUs.
Figure 8 shows how inference performance degrades
over a sustained period of time due thermal throttling of
the device. Mobile inference by applications typically oc-
cur in one of two modes: one-time detection or ongoing
run-time data processing. For one-time inference, e.g. ob-
ject detection, an application may achieve the peak perfor-
iOS Device TFLite GPU MPSCNN CoreML
iPhone Xs 2.3 4.1 7.1
iPhone 7 5.5 7.9 42
iPhone 6 31 92 116
Table 4. Average inference latency (in milliseconds) of iOS-
compatible ML frameworks on MobileNet v1.
Android Device TFLite GPU MACE SNPE
Samsung S9
13 12 6.9
(Adreno 630)
Xiaomi Mi8 SE
35.9 29.6 20
(Adreno 616)
Huawei P20 Pro
13.5 45 N/A1
(Mali G72-MP12)
Google Pixel 2
18 N/A2 N/A2
(Adreno 540)
Google Pixel 3
12.5 N/A2 N/A2
(Adreno 630)
Table 5. Average inference latency (in milliseconds) of Android-
compatible ML frameworks on MobileNet v1. Note that TFLite
GPU employs OpenGL and thus has the widest coverage with rea-
sonable performance. MACE and SNPE employ OpenCL and may
run faster on devices shipped with OpenCL, but may not run on
all devices. 1 Arm Mali GPUs are not compatible with SNPE. 2
Google Pixel devices do not support OpenCL.
mance illustrated in the left half of graph in Figure 8 where
device temperature is nominal. For ongoing run-time infer-
ence, e.g. video segmentation, the right half illustrates the
potential impact of thermal throttling due to sustained per-
formance.
In order to avoid data transfer delays, real-time applica-
tions usually place neural network input/output tensors in a
GPU texture or buffer. TFLite GPU allows using CPU-side
tensors as input/output as well. Additionally, CPU-to-GPU
data-transfer efficiency can be controlled via time or power
efficient synchronization mechanisms. The most power-
efficient one suspends waiting threads until the GPU com-
pletes its task. The fastest option by comparison, employs
an active spin-lock approach, reducing data acquisition de-
lays by avoiding operating system process re-scheduling.
7. Conclusion
In this paper, we presented the architectural design of
TFLite GPU. We described the properties of mobile GPUs
and explained optimization techniqueswe employed for fast
memory I/O, small run-time memory footprint, and fast
compute shader execution. With these, we aim to make the
network architects be mobile GPU-aware when they design
their networks.
From our discussion of mobile GPU-friendly data layout
2
5
5
0
7
5
1
0
0
1
2
5
1
5
0
1
7
5
2
0
0
0
5
10
15
20
25
30
iPhone Xs TFLite CPU
Pixel2 TFLite GPU
Pixel3 TFLite GPU
iPhone Xs CoreML NPU
iPhone 7 TFLite GPU
iPhone Xs MPS GPU
iPhone Xs TFLite GPU
Figure 8. Inference latency (in milliseconds) for MobileNet v1
over extended period of time [0, 200]sec (best viewed in color).
PHWC4, neural network designers should know that any
kind of RESHAPEs are significantly more expensive on the
GPU than on the CPU. The network itself will learn the
weights regardless of the RESHAPE op, thus it is best to skip
the operator entirely if a RESHAPE operation was inserted
just for convenience of the architect.
For the same reason, if the mobile device can produce
RGBA rather than RGB, it is now apparent that using the
former can avoid a conversion, i.e. memory copy, from
RGBA to RGB. Similarly, if the mobile device can ren-
der a 4-channel tensor, i.e. RGBA, directly, that can be a
better choice than the RGB counterpart. This choices ben-
efits not just the graph input/output, but also its interme-
diate tensors. Similarly, since we know that a tensor of
shape [B,H,W, 5], for instance, is twice as expensive as
[B,H,W, 4], but about the same as [B,H,W, 8], then the
architect can tune around those 4-channel boundaries rather
than trying to optimize on other boundaries.
TFLite GPU is still in its early development stages. We
plan to investigate several areas including employing ad-
ditional GPU-specific optimizations to improve inference
speed further, and expanding support for more operations,
e.g. understand more about recurring networks or LSTMs,
and how we can optimize those for GPUs. Finally, we are
extensively exploring other GPU backends such as OpenCL
and Vulkan to achieve better ALU utilization.
Acknowledgements
We would like to acknowledge our colleagues at Ten-
sorFlow Lite; Lawrence Chan, Tim Davis, Jared Duke, Yu-
Cheng Ling, Andrew Selle, Sarah Sirajuddin, and Pete War-
den. We are also grateful to Aleksandr Ignashev for the fig-
ures in this paper and Karthik Raveendran for his valuable
feedback.
References
[1] Metal Shading Language Specification. Apple Inc., 2014. 1,
2
[2] Apple Inc. Core ML.
https://developer.apple.com/documentation/coreml.
[Online, accessed Apr 8, 2019]. 2
[3] Apple Inc. Metal Performance Shaders.
https://developer.apple.com/documentation/metalperformanceshaders.
[Online, accessed Apr 8, 2019]. 2
[4] Arm Ltd. Compute Library.
https://developer.arm.com/ip-products/processors/machine-learning/compute-library.
[Online; accessed Apr 8, 2019]. 2
[5] Wikipedia contributors. Minimum-Cost Flow Problem.
https://en.wikipedia.org/w/index.php?title=Minimum-cost_flow_problem&oldid=883493365.
[Online; accessed Apr 8, 2019]. 6
[6] Facebook Inc. Caffe2. https://caffe2.ai. [Online;
accessed Apr 8, 2019]. 2
[7] Google LLC. Neural Networks API.
https://developer.android.com/ndk/guides/neuralnetworks.
[Online; accessed Apr 8, 2019]. 2
[8] Google LLC. TensorFlow Lite.
https://www.tensorflow.org/lite. [Online;
accessed Apr 8, 2019]. 1
[9] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu,
Tien-Ju Yang, and Edward Choi. MorphNet: Fast & Sim-
ple Resource-Constrained Structure Learning of Deep Net-
works. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 1586–1595, 2018. 2
[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry
Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-
dreetto, and Hartwig Adam. MobileNets: Efficient Con-
volutional Neural Networks for Mobile Vision Applications.
arXiv preprint arXiv:1704.04861, 2017. 2
[11] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu,
Anoop Korattikara, Alireza Fathi, Ian Fischer, ZbigniewWo-
jna, Yang Song, Sergio Guadarrama, and Kevin Murphy.
Speed/Accuracy Trade-offs for Modern Convolutional Ob-
ject Detectors. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 7310–7311, 2017. 2
[12] Huawei Technologies Co., Ltd. HiAI Engine.
https://developer.huawei.com/consumer/en/devservice/doc/2020315.
[Online; accessed Apr 8, 2019]. 2
[13] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth
Vanhoey, and Luc Van Gool. DSLR-Quality Photos on Mo-
bile Devices with Deep Convolutional Networks. In IEEE
International Conference on Computer Vision, pages 3277–
3285, 2017. 2
[14] Andrey Ignatov, Radu Timofte, et al. PIRM Challenge on
Perceptual Image Enhancement on Smartphones: Report. In
European Conference on Computer Vision, pages 315–333.
Springer, 2018. 2
[15] Donald B. Johnson. Efficient Algorithms for Shortest Paths
in Sparse Networks. Journal of the ACM, 24:1–13, 1977. 6
[16] Jon Leech, editor. OpenGL ES Version 3.1. The Khronos
Group Inc., 2016. 1, 2
[17] MediaTek Inc. What is MediaTek NeuroPilot?
https://www.mediatek.com/blog/what-is-mediatek-neuropilot.
[Online; accessed Apr 8, 2019]. 2
[18] August F. Mo¨bius. Der baryzentrische Calcu¨l. 1827. 3
[19] Edward F. Moore. The Shortest Path Through a Maze. In
International Symposium on the Theory of Switching, pages
285–292, 1959. 6
[20] Qualcomm Inc. Snapdragon Neural Processing Engine SDK.
https://developer.qualcomm.com/docs/snpe.
[Online; accessed Apr 8, 2019]. 2
[21] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-
moginov, and Liang-Chieh Chen. MobileNetV2: Inverted
Residuals and Linear Bottlenecks. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 4510–
4520, 2018. 2
[22] Ravi Sethi. Complete Register Allocation Problems. SIAM
Journal on Computing, 4:226–248, 1975. 5
[23] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen,
Sy Choudhury, Marat Dukhan, KimHazelwood, Eldad Isaac,
Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu,
Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew
Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram
Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao
Zhang. Machine Learning at Facebook: Understanding In-
ference at the Edge. In IEEE International Symposium on
High-Performance Computer Architecture, 2019. 2
[24] Xiaomi. MACE. https://github.com/XiaoMi/mace.
[Online; accessed Apr 8, 2019]. 2
