Performance evaluation of deep learning on smartphones by Srivastava, Abhishek
c© 2019 Abhishek Srivastava




Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the





Deep Learning powers a variety of applications from self driving cars and autonomous
robotics to web search and voice assistants. It is fair to say that it is omnipresent and here
to stay. It is deployed in all sorts of devices ranging from consumer electronics to Internet
of Things (IoT). Such a deployment is categorized as inference at the edge. This thesis
focuses on Deep Learning on one such edge device - Mobile Phone. The thesis surveys the
space of Deep Learning deployment on mobile devices, and identifies three key problems
- (a) lack of common programming interface, (b) dearth of benchmarking systems and (c)
shortage of in-depth performance evaluation. Then, it provides a solution to each one of
them by (a) providing a common interface derived from MLModelScope [1], referred to
as mobile Predictor (mPredictor), (b) providing a benchmarking application and (c) using
aforementioned mPredictor and benchmarking application to perform a detailed evaluation.
This work has been developed to assist a generic mobile developer in integrating Deep
Learning service in his/her application.
ii
To my family, for their love and support.
To my friends, for their constant presence and guidance.
iii
ACKNOWLEDGMENTS
My journey at University of Illinois at Urbana-Champaign has been a fun roller coaster
ride, filled with many positive and a few testing moments. This journey would not have been
possible without the guidance of my advisor, Professor Wen-Mei Hwu. I took up his course
Electrical and Computer Engineering (ECE) 408 in my first semester due to my interest in
heterogeneous computing. It was undoubtedly the best curriculum experience of my life,
which left me mesmerized by Professor Hwu’s teaching style. I am extremely thankful to
him for taking me under his wings and guiding me throughout my graduate school. I would
like to give a special mention to Dr. Jinjun Xiong from IBM Research Labs for his invaluable
discussions and guidance.
Being part of Illinois Microarchitecture Project using Algorithms and Compiler Technology
(IMPACT) research group has been a very fruitful experience. It is not often that one is
surrounded by so many technically gifted yet grounded colleagues. I would like to give a
special mention to Abdul Dakkak and Cheng Li for helping me out and guiding me during
our work on MLModelScope. I hope their efforts on this project are rewarded through its
usage by people. I would also like to thank Carl Pearson, Vikram Sharma Mailthody, Simon
Garcia De Gonzalo and rest of the members of the group. Special mention to Marie-Pierre
Lassiva Moulin for making our lives much easier.
Graduate school is a very competitive place, driven by results and filled with many talented
and ambitious students. It truly can be a struggle. One needs a good set of people around
to keep him sane. Fortunately, I had a group of friends who did that for me. I would like to
thank Siddhartha Satapathy, Dhruv Agarwal, Rohit Agrawal and Dipali Ranjan for being
my backbone, and hopefully I was theirs. A special mention to Kartik Hedge with whom I
realised how fun research could be.
Finally, I owe all my success to my family. I would be absolutely nothing without them. I
would like to thank my mother for her constant emotional support, my father for his personal
sacrifices and my sister for keeping me grounded and motivated throughout my journey.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Lack of Common Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Dearth of Benchmarking Systems . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Lack of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 MLModelScope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Mobile Predictor (mPredictor) . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Create a Framework mPredictor . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Use a Framework mPredictor . . . . . . . . . . . . . . . . . . . . . . . . . . 14
CHAPTER 4 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 MLModelScope Mobile Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Model Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3 Model Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Model Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Hardware Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
v
CHAPTER 1: INTRODUCTION
Artificial Intelligence (AI) applications have exhibited a massive resurgence in the past
decade. This resurgence has been led by Deep Learning [2]. It is the foundation of AI
applications such as cancer detection [3], self driving cars [4], voice assistant systems like
Google Assistant, Apple Siri and so on. In fact, it is able to exceed human accuracy in many
of these tasks. The reason behind the success of Deep Learning models is their ability to
use statistical learning techniques over a large amount of data to extract high-level features
from raw input data, and provide an effective representation of an input space with minimum
human intervention.
Deep Learning models are made up of multiple horizontally and vertically stacked compu-
tational blocks called neurons. Each neuron takes input activations and weights as inputs,
computes a weighed sum across the input activations, applies a non linear function and then
outputs the so called output activations. The organization of these neurons along with their
weighted connections define a Deep Learning model architecture. Given the Deep Learning
model architecture and sufficient amount of training data, network undergoes the process
of learning weights, which is called training. Once trained, network can perform its task
by taking an input, and computing the output using the learned weights. This process is
called inference [5]. Deep Learning models are trained once and then deployed in real world
applications to perform repeated inference. Note that they may be re-trained from time to
time based on the availability of new training data.
Inference can be performed at different locations in the application pipeline. Given that
Deep Learning models are generally compute bound, most of these inferences were performed
in cloud servers until recently. In the past couple of years, space of edge computing in Deep
Learning can suddenly exploded. Inference at the edge, as it is commonly referred to, refers
to performing computation at the end, ”edge”, of the network as opposed to sending it to
the cloud. This saves considerable expense and time involved in transmitting and receiving
data across the network, thus providing improved performance. This is especially relevant
for user facing applications. Also, since the data is no longer being transmitted on a publicly
accessible network, privacy issues are alleviated. Common edge devices are robots, Internet
of Things (IoTs), cars and so on.
Mobile device is one such edge device. As of November 2019, 5.15 billion people have
mobile devices, which maps to 66.6% of the population [6]. This implies that two out of every
three people possess a smartphone. These devices run applications like Facebook, Instagram,
WhatsApp, Google Assistant, Siri and so on, all of which employ Deep Learning techniques.
1
Wu et al. [7] claim that over 1 billion people use Facebook and related application services.
These numbers indicate that mobile devices are probably the important and prominent edge
device in the market today. Hence, this thesis explores the space of Deep Learning on mobile
devices.
Through an extensive literature survey, the thesis raises fundamental issues constrain-
ing mobile developers in integrating Deep Learning services in their applications. Firstly,
plethora of software frameworks and potential hardware backends for Deep Learning induces
large amount of programming overhead on the developers. Secondly, there is a dearth of
publicly available benchmarking system aimed at Deep Learning. Lastly, there is a clear
absence of performance evaluation in this space, which can potentially guide developers in
the process. The thesis motivates and identifies the presence of these problems and provides
a solution to each one of them.
Contributions. To summarize, this work makes the following contributions:
1. It surveys the space of Deep Learning on mobile devices to identify three key issues
- lack of common programming interface, dearth of benchmarking system and lack of
performance evaluation. This is covered in Chapter 2.
2. It provides a solution to the first issue by presenting a framework agnostic programming
interface derived from MLModelScope [1] and adapted for mobile devices. Chapter 3
elaborates on the interface, referred to as mPredictor, and illustrates how to create
and use a framework mPredictor.
3. It provides a solution to the second issue by presenting a benchmarking application,
referred to as MLModelScope mobile agent. Chapter 4 details the implementation
of the agent, and then targets the third issue by providing an in-depth performance
evaluation for Deep Learning on mobile devices. It describes popular models and
provides a model classification taxonomy. It presents the effects of model versioning
and optimization on performance. Finally, it provides a detailed study on hardware
acceleration for aforementioned models.
2
CHAPTER 2: MOTIVATION
Machine Learning execution consists of two phases - training and inference. While all
training runs are performed in datacenter, inference runs are being pushed towards edge
platforms. There are many reasons behind it.
Firstly, mobile devices are one of the most popular devices used by humans today. A
human may or may not have a desktop computer, but would most definitely hold a mobile.
This implies that mobile devices are the primary target for all service based technological
companies. For instance, Facebook makes over 90% of its advertising revenue from mobile [7].
And since many of the tasks of mobile applications now involve machine learning, the number
of inferences is increasing with each passing day.
Secondly, running inference on the edge rather than the datacenter has latency and security
related benefits. For one, it improves application response time since it no longer has to ping
an external infrastructure. Also, it reduces users’ network bandwidth needs which further
improve their experience.
Thirdly, inference on the edge makes certain deep learning services possible, like Instagram
features that involve real time machine learning at the image capture time [7]. To summa-
rize, machine learning at the edge, especially on mobile devices, is increasingly becoming
omnipresent. This calls for heightened interest in this problem space.
So, it has been established that inference on mobile devices is a problem worth looking at,
but on a more in-depth view of the space, one can observe many issues which bottleneck the
implementation of efficient inference. The following sections shed some light on such issues.
2.1 LACK OF COMMON INTERFACE
Deep Learning models are compute hungry in nature. So, when one thinks about running
repeated inference of such models on mobile devices, his/her primary concern is choosing
the best possible hardware backend. This is actually slightly different from how mobile
developers think for other applications. While most of the mobile applications might benefit
from faster hardware backend, they can get away with a less optimized one since the relative
speedup is not going to be substantial. But, for applications deploying Deep Learning
models, the application response time is primarily bottle-necked by model execution. This
makes it imperative to choose best possible hardware backend.
Mobile device hardware is made up of different Intellectual Properties (IPs) aggregated on
a single chip, System on Chip (SoC). The reason behind a SoC based processor architecture
3
is two fold. Firstly, mobile devices run all sorts of applications ranging from audio visual to
network processing, which implies that they endure both general as well as special purpose
computations. Secondly, the end of Dennard Scaling and Moore’s Law has driven hardware
developers to incorporate energy efficient accelerators along with general purpose Central
Processing Units (CPUs).
Thus, a typical mobile SoC has a high speed modem for LTE and WiFi connectivity,
video encoders and decoders for video playback and capture, camera image signal processor
for high frame rate, low latency camera and so on [8]. The presence of so many compute
backends on a single chip makes SoC space very diverse. Also, as of 2019, there were over
22 billion SoCs shipped out [9]. The massive diversity amongst SoCs present in the market
makes programming them a major headache.
From Deep Learning perspective, there are several candidate hardware backends. Obvi-
ously, their general availability makes CPUs as primary candidates. Recent times have seen
Deep Learning inference being done on server/desktop grade Graphics Processing Units
(GPUs). So, it is intuitive for one to do the same on mobile GPUs. Compute Digital Signal
Processors (DSPs) can also be used on mobile SoCs, since they can provide similar perfor-
mance benefits as mobile GPUs. Also, hardware vendors are now starting to move towards
specialized processing units on mobile SoCs for Deep Learning inference, commonly referred
to as Neural Processing Units (NPUs). Some examples are Neural Engine in Apple A12
Bionic SoC [10] and Huawei’s NPU on Kirin 970 SoC [11].
Programming any of these candidate hardware backends requires a software infrastructure.
Historically, domain specific software frameworks have been developed to program Machine
Learning architectures, train and deploy them. Tensorflow [12], MXNet [13], Pytorch [14]
are some examples. The space of Deep Learning on mobile devices sees a similar plethora
of software frameworks. They can be broadly divided into two classes - Domain Specific
Compilers and Vendor Specific Application Programming Interfaces (APIs). The first con-
tains generic interpreters like Tensorflow Lite [15], Caffe2 [14] that compile model code using
optimized backend. It also contains heavy compilers like TVM [16], Glow [17] that compile
applications containing models to platform specific object code. The second consists of SoC
vendor specific APIs like Qualcomm Neural Processing SDK [18] and Apple’s CoreML [19].
The presence of multitude of options for software frameworks makes choosing one of them
a non-trivial task. Firstly, each framework has its own coding methodology and programmer
APIs. While some may require excessive programming from the developer’s side (like Ten-
sorflow Lite, Qualcomm Neural Processing SDK), some essentially require the programmer
to provide a trained model file and the choice of hardware backend, the rest will be covered
by the framework (like TVM, Glow). Secondly, to the best of our knowledge, there is no
4
public study demonstrating the difference in performance across these frameworks.
To summarize, the difficulty in choosing an appropriate software framework and the con-
sequent programmer overhead of incorporating it in one’s application are major roadblocks
in the usage of Deep Learning on mobile devices.
2.2 DEARTH OF BENCHMARKING SYSTEMS
Given that performance is a vital metric when deploying Deep Learning in real world ap-
plications, it is imperative to have standardized benchmark systems to evaluate and compare
different software and hardware systems used.
Historically, there have been many benchmark applications developed for the space of
mobile devices. Geekbench [20] and Antutu [21] are two such popular benchmarks. They
consist of microbenchmarks, which are kernels targeting compute and memory workloads.
They implement fragments of algorithms for various popular mobile applications like web
browser, gaming and so on. As output, they present a weighted score accumulated across all
the tests. While these applications are the norm for general mobile performance estimation,
they are not aimed at Machine Learning, and hence do not present relevant performance
numbers.
Recent times have seen a few Machine Learning focused benchmark suites for mobile
devices. AIBenchmark [22] is one such suite. It has been designed along the lines of afore-
mentioned suites like Geekbench, but for Artificial Intelligence tasks. While it is very in-
formative, it is a closed application, and hence does not give users liberty to perform fine
grained testing and analysis.
MLPerf Inference Benchmark [23] is an industry wide standard ML benchmarking and
evaluation system being developed by a consortium of companies and universities, aimed
specifically at Deep Learning Inference. It has a finite set of tasks, with each task accom-
panied by one or two models, and participant companies making their best submissions for
the chosen and model on their hardware. While it seems to be a good initiative, it is a
static benchmark suite driven by involved parties. This implies that it in no way covers
the complete spectrum of inference workloads. So, these results might not be relevant for a
given user. Also, it is not aimed specifically at mobile devices, rather it covers a wide range
of hardware backends.
So, while there are some options for benchmark suites for inference, there is scope for
a system for mobile devices, which enables easy benchmarking, better understanding and is
easily customizable by a developer.
5
2.3 LACK OF EVALUATION
As mentioned in the aforementioned section, most of the benchmarks accessible to a
generic mobile application developer provide a high level, crude numerical score. The score
provided by each benchmark is very subjective, since the tasks and relative weights it uses
to aggregate the score vary. This makes interpreting performance a cumbersome task. Also,
there is a dearth of publicly available studies which mobile application developers can use as
reference to understand performance at a fine grained scope and make informed decisions.
This demands existence of fine grained performance studies of using Deep Learning in real
world mobile applications so as to provide developers a good reference point.
6
CHAPTER 3: IMPLEMENTATION
As described in Chapter 2, there are a plethora of software frameworks that can be used
to integrate Deep Learning models in a mobile application. The sheer amount of choices
makes choosing one a cumbersome task especially if one is concerned with the performance
implications. More importantly, using the chosen framework in a mobile application is non-
trivial. The Application Programming Interfaces (APIs) of these frameworks vary drastically.
This adds additional responsibility on the application developer to understand the software
interfaces and use them appropriately.
Mobile devices primarily run two Operating Systems - Android [24] and iOS [25]. Both
of them are very diverse and encompass a wide ecosystem of tools and infrastructure. A
developer has to develop separate applications for both these platforms. And given that the
frameworks might have slightly different interface across these platforms, the programmer
overhead of integrating Deep Learning models might be increased when trying to support
both the platforms. Such a wide spectrum of frameworks can also make the process of
transitioning from one framework to another a difficult job.
For all the aforementioned reasons, we feel the need of having a layer of abstraction above
all these frameworks, so as to reduce programmer overhead and hence make integration of
Deep Learning in mobile applications an easier task. This chapter proposes one such in-
terface developed as part of the MLModelScope [1] ecosystem. The first section describes
MLModelScope in detail while the latter sections present the MLModelScope mobile pre-
dictor (proposed interface), elaborate on how to add a framework to this interface and the
different modes of usage of the same.
3.1 MLMODELSCOPE
As described in Chapter 2, Deep Learning landscape is composed of non-uniform model,
software and hardware stacks and evaluation methodologies. Interestingly, Deep Learning
model performance is influenced by the software framework, system library, compiler and
hardware platform used to run the inference (or even train the model) as illustrated in
Fig 3.1. There is a dearth of consistent tools that make it simple and fair to compare
different Deep Learning innovations and present an evaluation of performance. Dakkak and
Cheng et al. [1] propose MLModelScope as a solution to this problem.
MLModelScope is a hardware/software agnostic framework to evaluate and profile Deep












































Figure 3.1: Shows the execution of an AI pipeline at different levels of hardware and software
abstraction. 1 An application pipeline (which usually spans multiple machines) employs
a set of models. 2 Each model defines its own pipeline for input and output processing.
3 A framework executes a model through a network-layer execution pipeline. 4 Layers
executed by a framework are pipelines of library calls. 5 The library calls in turn invoke
a chain of system runtime functions. All the while the 6 hardware has its own pipeline to
execute instructions. Because of the many levels of abstractions, the HW/SW stack must
work in unison to maintain accuracy, performance, and efficiency (figure from [1]).
The following are some of its salient features.
• It is a consistent evaluation and aggregation framework defining techniques to specify
and provision workflows with hardware/software stacks, abstractions for evaluation and
profiling using different Deep Learning frameworks and an elaborate data consumption
8
1 name: AlexNet # model name
2 version: 1.0 # semantic version of model
3 description: ...
4 element_type: float32 # datatype used for compute
5 framework: # framework information
6 name: MXNet
7 version: ^1.x # framework version constraint
8 container: # available containers
9 amd64:
10 cpu: mlmodelscope/mxnet:1-4-0_amd64 -cpu
11 gpu: mlmodelscope/mxnet:1-4-0_amd64 -gpu
12 ppc64le:
13 cpu: mlmodelscope/mxnet:1-4-0_ppc64le -cpu
14 gpu: mlmodelscope/mxnet:1-4-0_ppc64le -gpu
15 inputs: # model inputs
16 - type: image # first input modality
17 parameters: # type parameters
18 dimensions: [3, 227, 227]
19 mean: [0, 0, 0]
20 color_mode: RGB
21 layout: NCHW
22 outputs: # model outputs
23 - type: classification # first output modality
24 parameters:
25 url: https://.../ synset.txt
26 preprocess: [[code]]
27 postprocess: [[code]]
28 model: # model resources
29 graph_path: repo://mxnet/alexnet.json
30 weights_path: repo://mxnet/alexnet.params
31 attributes: # extra model attributes
32 training_dataset: # dataset used for training
33 - name: ImageNet
34 - version: 1.0.0
Figure 3.2: AlexNet’s model manifest contains all information needed to run the model using
MXNet on CPUs or GPUs (listing from [1])
pipeline for experiment outputs.
• It enables profiling of experiments at different abstraction levels as illustrated in Fig. 3.1
throughout the complete AI application.
• It is framework and hardware agnostic supporting frameworks like Tensorflow [12],
MXNet [13], Pytorch [14] and hardware like X86, IBM PowerPC and FPGAs.
• It provides a command line, web and API interface and can very well be used as a
standalone library.
• It is extensible and customizable, implying that users can add newer models, frame-
works, system libraries or profilers.
It is an effective system solution for specifying and running model evaluation. A user can
specify the evaluation specification through a model manifest file which is carefully designed
to capture all evaluation conditions. Fig. 3.2 is an example manifest file. The model manifest
describes the model metadata (Lines 1–2), framework name and version (Lines 5–7), docker
containers to use for evaluation (Lines 8–14), model inputs and common pre-processing
steps (Lines 15–21), model outputs and common post-processing steps (Lines 22–25), custom
pre/post-processing operations (Lines 26–27), model resources (Lines 28–30), and attributes
9
(Lines 31–34). Evidently, it encompasses all the environment parameters required to run an
evaluation.
Figure 3.3: Illustrates the internal design of MLModelScope. 1 Framework Agent (in-
terfacing with framework wrapper) registers its presence to MLModelScope’s distributed
registry. 2 The User inputs a model manifest file with the appropriate model file, frame-
work constraint, hardware constraint, and pre/post processing parameters, through the web
or command line interface. 3 The user interface contacts remote API handler. 4 and
5 Remote API handler touches base with Distributed registry and runs the appropriate
framework agent with the specified model. 6 and 7 Evaluation data is collected and
stored in the Evaluation Database (figure from [1]).
Fig. 3.3 illustrates the internal design of MLModelScope as provided by Dakkak and Cheng
et al. [1]. Kindly refer to the cited paper for further details. For the context of this work, the
most part of the design is the framework Agent and Wrapper. This is written out by writing
a Wrapper around the framework which is then referenced in its Agent. In this way, any new
framework can be plugged into MLModelScope system. The next section will present the
Wrapper interface for mobile devices (referred to as mPredictor in the subsequent sections).
10
3.2 MOBILE PREDICTOR (MPREDICTOR)
A Deep Learning model deployment can be split up into four parts. Firstly, at the start
of the application, an instance of the framework is created. This requires the model file as a
mandatory argument. Now, for every input, the framework instance is invoked to perform
an inference. The output of the inference are one or more tensors with raw prediction
values. To complete the inference for that input, the raw predictions are translated into
output predictions. The process of performing inference and translating raw predictions
into application parse-able output are performed for every input instance. Lastly, prior to
the application being closed, the framework instance is deleted.
Figure 3.4: Presents the proposed interface. An inference deployment can be broken up into
four parts. 1 At the start of the mobile application, the framework being used is setup
up with the appropriate model and other framework specific options. 2 For a given input,
inference is performed. 3 For the given input, output predictions are provided from raw
network predictions. Note 2 and 3 are repeated for every inference instance. 4 At the
end of the application, framework object is deleted.
The proposed interface maps each step to a generic function call. The first step of creating
and setting up a framework instance is enveloped under New() function call. The inference is
performed by calling ‘Predict() followed by ReadPredictionOutput() function calls. Finally,
11
the step of cleaning up the framework instance is performed by calling Delete() function.
Fig. 3.4 illustrates the aforementioned mapping.
New() takes model file, mobile hardware backend to use and corresponding batch size as
input arguments. It returns a predictor object which is then kept to call latter functions.
Predict() requires the created predictor object and data to be inferred on as arguments.
ReadPredictionOutput() requires the predictor object and label file(s) associated with the
model. Finally, Delete() just requires the predictor object to be deleted. The proposed
interface is referred to as a Mobile Predictor (mPredictor).
Note that the mPredictor interface may have other auxiliary function calls as well, but
we argue that these four encompass the minimal set of function calls needed to deploy a
model in a mobile application. The proposed interface provides a uniform interface to the
user unlike frameworks which require the user to dive into the software system to figure out
the usable APIs. This interface conveniently hides framework specific details from the user.
We argue this makes incorporating a predictor straightforward, and improves programmer
productivity.
This interface is derived from MLModelScope ecosystem. A similar interface is used for
writing a Wrapper (as referenced in Fig. 3.3) for a given software framework. Notably, the
Wrapper interface has been slightly modified to suit the mobile ecosystem. The next section
covers how to add a new framework in accordance to the mPredictor interface.
3.3 CREATE A FRAMEWORK MPREDICTOR
The mPredictor interface specified in the prior section is ultimately a layer of abstraction
which is aimed at hiding framework specific details from the application developer. This en-
tails a one time overhead of integrating framework API calls to this interface. This overhead
would be taken up by a library developer, whose work would be made open sourced for a
generic mobile application developer to use.
The mPredictor interface is implemented in Golang language [26] (commonly referred to as
Go). Go is an open source programming language that makes it easy to build simple, reliable
and efficient software. It is gradually becoming very popular as it has the performance
capabilities of languages like C++ but with scripting interface of languages like Python. The
choice of this language was primarily driven due to its extensive usage in MLModelScope.
Most of the software frameworks for Deep Learning are monolithic C++ libraries with
language bindings in Python, Java, Golang etc. The choice of languages for which binding
is provided by the framework is completely dependent on the framework developers. The
12
Figure 3.5: Illustrates how to create a framework mPredictor in accordance with the proposed
interface. Left hand side illustrates the API calls to be made to write a Tensorflow Lite [15]
predictor. Right hand side specifies the corresponding Qualcomm SNPE [18] API calls.
This workflow is supposed to aide future mPredictor developers. It also emphasizes that the
proposed interface is sufficient to hide the required framework specific details.
frameworks which are aimed to be used in mobile applications might have language bindings
in Java (Android) and Objective-C/Swift (iOS).
Integration with mPredictor interface requires one to envelope mPredictor function calls
over framework C++ API calls. For frameworks like TVM [16], which provide language
binding in Go, one can directly call framework Go API calls from within the mPredictor
interface.
Fig. 3.5 summarizes the process of creating mPredictor for two popular mobile frameworks
- Tensorflow Lite [15] and Qualcomm SNPE [18]. The New() function for Tensorflow Lite
would require creating a FlatBuffer model (tflite::FlatBufferModel) and passing it to an
13
Interpreter (tflite::Interpreter). Also, based on the hardware backend chosen by the user, the
Interpreter graph will be modified with the appropriate Delegate (CPU, GPU, NNAPI). For
implementing New() function for Qualcomm SNPE, one would have to create a DL Container
(zdl::DlContainer::IDlContainer) and the corresponding SNPE builder (zdl::SNPE::SNPE ).
The builder would require the specification of the hardware backend throught SNPE DL
Runtime variable (CPU, GPU, DSP).
The Predict() function would require transferring raw input data into the appropriate
input handling data structure - typed tensor in tflite::Interpreter for Tensorflow Lite, and
zdl::DlSystem::ITensor for Qualcomm SNPE. After that, to perform the actual inference,
one would be required to Invoke the Interpreter in Tensorflow Lite implementation, while
execute the SNPE Builder in Qualcomm SNPE implementation. Then the raw predictions
would have to be stored in a private data structure.
Interestingly, ReadPredictionOutput() implementation would be independent of the frame-
work, rather it will be dependent on the task (image classification, object detection etc) for
which the predictor is being built. The Delete() function would delete the interpreter in the
case of Tensorflow Lite and SNPE Builder in the case of Qualcomm SNPE.
For the two frameworks chosen, one can observe that different framework specific struc-
tures have to be created and used, but at a higher level, they accomplish the same task.
Tensorflow Lite requires its user to create an interpreter while SNPE a Builder. Both of
them are effectively software runtimes which run the model graph with the available back-
end libraries. Also, one may also note that converting raw predictions into application ready
output is independent of the framework choice once the raw predictions have been fetched
from the runtime, which is why the proposed interface segregates Predict() and ReadPre-
dictionOutput() functions. This would allow one to do a much more accurate performance
analysis as will be evident in a later chapter.
The aforementioned examples assert that the mPredictor interface is general enough so
that it can be used with any framework. The next section elaborates on how to use a
framework mPredictor.
3.4 USE A FRAMEWORK MPREDICTOR
The previous section elaborated on how to create a framework mPredictor. Now to use it
in an application, the user would have to dynamically link it to the framework. This would
require either building the framework from source or fetching already built shared library
for the appropriate hardware backend. For instance, Tensorflow Lite can be easily built for
Android and iOS backends using its configure file and Bazel [27]. Qualcomm SNPE provides
14
pre-built binaries for its target architectures. Once linked, one would have to compile the
mPredictor using Gomobile [28].
Gomobile is a project developed by Go developers to enable easy usage of Golang appli-
cations on mobile devices. Its tooling allows a user to cross compile an application written
in Go to Java or Objective-C. The corresponding Java bindings can be used in Android
application while Objective-C bindings in iOS application (directly for Objective-C based
applications while indirectly for Swift applications). Usage of this tool allows the proposed
mPredictor interface to be compatible with both Android and iOS mobile applications which
is another of its salient features.
Gomobile generates an Android Archive file (.aar) for usage in Android, and a Framework
file (.framework) for iOS applications. To incorporate the predictor, the user would have to
link the aforementioned file (.aar for Android and .framework for iOS) and framework shared
libraries to the base mobile applications. This would allow the user to use the framework
mPredictor in his/her application.
Note that one can choose to have multiple such mPredictors of different frameworks in the
same application and choose between them seamlessly as per requirement. He/she can even
choose to have different mPredictors of the same framework but with different models, and
effectively choose between the models based on inference performance. Interestingly, since
hardware backend chosen might also be an optimization point, one can choose to deploy
different mPredictors of the same framework but with different hardware backends and
select the best one based on an initial sample run. It is evident that apart from reducing
programmer overhead in incorporating Deep Learning models in mobile applications, the
uniform interface of mPredictor provides users opportunities to play around with different
versions by simply changing an argument or creating a separate predictor, rather than having
to understand framework specific details to do the same. The hope is to have as many such
framework mPredictors as possible in the public domain for any mobile developer to use.
15
CHAPTER 4: EVALUATION
Chapter 3 proposed a framework agnostic interface, MLModelScope mPredictor, in or-
der to alleviate programmer burden in incorporating Deep Learning in mobile applications.
Mobile developers would no longer have to bear the steep curve of learning Deep Learning
frameworks. As described in Chapter 2, Deep Learning models are compute heavy. This
implies that the runtime performance of such mobile applications would be largely influenced
by their Deep Learning part. As elaborately explained by Dakkak and Cheng et al. [1], Deep
Learning runtime performance is dictated by the choice of model, software framework and
hardware backend used.
(a) Google Pixel 2 (b) Samsung Galaxy S9
Figure 4.1: MLModelScope Mobile Agent running on two popular mobile devices
Now, it is unfair to expect a generic mobile application developer to have sufficient exper-
tise in Deep Learning to make the most performant choice across the stack. This calls for
the development of a benchmarking platform for such developers. The platform should be
accessible to everyone, and should provide enough nobs for the users to empirically determine
16
the Deep Learning related bottlenecks in their application. More importantly, it calls for the
presence of in depth, across the stack evaluation and analysis of the different performance
influencing components in the stack. As described in Chapter 2, there is a dearth of such
publicly available studies which can be used by generic mobile developers as a performance
guide and then make well informed decisions.
Figure 4.2: Choose Framework
This chapter first describes our benchmarking platform, MLModelScope Mobile Agent.
Then it explores different questions which can potentially affect performance, providing
evaluations to assist the claims.
4.1 MLMODELSCOPE MOBILE AGENT
The agent deployed is in fact a mobile application. Fig. 4.1 illustrates the landing page
of the agent running on a Google Pixel 2 and a Samsung Galaxy S9 devices. The agent
is currently aimed at vision based applications, with all the evaluations provided for image
classification task. The application uses the backside camera of the mobile device to generate
17
Figure 4.3: Choose Model
static images which are then passed as inputs to the mPredictor(s) deployed in the applica-
tion. The purpose of this agent is to allow a user to evaluate different parameters across the
stack which influence performance. To enable such testing, the agent provides an interface
to the user to choose an option in each category and consequently evaluate performance.
The first option is to choose framework as illustrated in Fig. 4.2. The frameworks are
integrated by writing their respective mPredictors as mentioned in Chapter 3. At the time
of writing, the agent has Tensorflow Lite mPredictor deployed, and the addition of other
frameworks is in the process.
The second option is to choose models deployed as part of the agent as illustrated in
Fig. 4.3. Most of the common models are in-built into the application to enable easy bench-
marking, but since the agent application is open sourced, the user can add custom models
as well.
The third option is to choose hardware backend available on the SoC underneath as illus-
trated in Fig. 4.4. Based on the mPredictors deployed, user can potentially run evaluation
on CPUs with different number of threads and accelerators like Graphics Processing Unit
(GPU), Digital Signal Processor (DSP) and so on. The caveat here is that all hardware
18
Figure 4.4: Choose Hardware
backend modes may not be supported on a given device because of many reasons. For in-
stance Tensorflow Lite requires atleast OpenGL ES 3.1 to support GPU execution, while it
does not support DSP execution directly.
The fourth option is to choose datatype of the model. As will be explored in a later
section, quantization is a very popular model optimization technique. The agent supports
Float and Int as datatypes in which the model weights/activations can be stored and run.
Once all the options are chosen, the user can press Predict button to run a live infer-
ence on the camera feed observed by the device. Fig. 4.6 illustrates the evaluation page.
The output contains the input parameters chosen, Top 5 predictions on the camera feed
and corresponding performance metrics. The agent provides two metrics for evaluation -
Inference Latency in milliseconds and Throughput in number of inferences per second. Both
the metrics are generated by running a fixed number of inferences (default = 1000) and
averaging out timing quantities. Notably, it also breaks up the inference workflow and pro-
vides individual timings to provide a detailed view of the execution. The quantities provided
are Model Loading in milliseconds, Model Coldstart in milliseconds, Data Preprocessing in
milliseconds, Model Computation in milliseconds and Data Postprocessing in milliseconds.
19
Figure 4.5: Choose Datatype
We’ll take a deeper dive into this breakup in a later section.
The agent will also be deployed through MLModelScope web interface going forward to
enable a single touch evaluation on the available devices. This would require writing a server
framework predictor as illustrated in Fig. 3.3 which would be able to communicate with the
deployed devices through a networking service.
This section described our evaluation setup. Rest of the sections in this chapter will
provide a detailed evaluation and analysis of different performance influencing aspects of
Deep Learning on mobile devices.
20
Figure 4.6: Run Evaluation
4.2 MODEL CLASSIFICATION
The very first step that an application developer has to take when deploying Deep Learn-
ing is choosing the most appropriate model. While the accuracy provided by the model
is generally considered to be of primary importance, other factors like inference latency,
throughput, model size also become relevant since mobile devices are resource constrained
and low power devices. The model performance in terms of accuracy is usually publicly
available, published by the model developers. These numbers (Top-1 and Top-5 percent)
21
should provide the user an initial ballpark to work with, though actual performance may
still vary. The user can then use MLModelScope mobile agent to generate runtime perfor-
mance related numbers like inference latency, throughput, model size. In this section, we
will look at six popular models, understand the theoretical logic behind their construction
and evaluate their performance.
4.2.1 Densenet
Figure 4.7: A 5-layer dense block (figure from [29])
Dense Convolutional Network (Densenet) was proposed by Huang et al. [29] in 2016. It is
based on the notion that a model architecture can be made more effective by connecting each
layer to every other layer in the network. This implies that while a conventional network of
L layers would have L connections, a dense network would contain L(L + 1)/2 connections.
Fig. 4.7 illustrates an instance of such a dense block.
Fig. 4.8 presents a peak into the actual architecture of the Densenet model. The first half
of the figure is a sequence of mul, add and convolution operators. This represents one layer
of the model. So, in Fig. 4.8, one can observe two consecutive layers which are connected
to rest of the layers through Concatenation operators. Notably, this network is very deep,
implying that it contains many such dense blocks.
22
Figure 4.8: A snapshot of Densenet network architecture as viewed on Netron [30]
4.2.2 Inception
Figure 4.9: An Inception module (figure from [31])
Inception is a family of networks originally proposed by Szegedy et al. [31]. An instance
of Inception network is made up of multiple inception modules. The fundamental idea
23
behind an inception module is to factor a large convolution into convolutions of different
sizes (usually smaller) and types and then to stack all the outputs together, to be passed as
input to the next inception module. Fig. 4.9 illustrates an inception module.
Figure 4.10: A snapshot of Inception-v3 network architecture as viewed on Netron [30]
Fig. 4.10 presents a peak into the actual architecture of Inception-v3 model (more on
model versioning in the next section). There are two inception modules linked together
through a Concatenation operator. One can observe multiple branches within each module
instance (there are two modules in Fig. 4.10). Each branch may comprise of one or more of
1x1, 3x3, 5x5 convolution operators.
4.2.3 Resnet
Resnet is a family of networks originally proposed by He et al. [32]. An instance of Resnet
network is composed of many residual blocks. These residual blocks have identity shortcut
connections, where the idea is to have output from a previous layer (not the one just before
current one) as an input to current layer. Fig. 4.11 illustrates a residual block.
Fig. 4.12 presents a peak into the actual architecture of Resnet v2 101 model (more on
model versioning in the next section). There are two residual blocks. One can observe two
24
Figure 4.11: A Residual block (figure from [32])
Figure 4.12: A snapshot of Resnet v2 101 network architecture as viewed on Netron [30]
branches within each block where one consists of multiple successive convolution operators
while the other skips everything and goes as accumulated input to the next block.
25
4.2.4 Mobilenet
(a) Conventional convolution (b) Depthwise convolution (c) Pointwise convolution
Figure 4.13: Mobilenet building block (figure from [33])
Mobilenet is a family of networks proposed by Howard et al. [33] and custom built for
mobile and embedded vision applications. This family replaces conventional convolution
layers with a depthwise convolution layer followed by a pointwise convolution layer. Fig. 4.13
illustrates the transformation, where Fig. 4.13a is replaced by a combination of Fig. 4.13b and
Fig. 4.13c. Notably depthwise convolution works per channel and then pointwise convolution
aggregates across channels.
Figure 4.14: A snapshot of Mobilenet-v1 network architecture as viewed on Netron [30]
In fig. 4.14, one can observe two successive layers in Mobilenet. This architecture has
considerably lower compute as compared to networks with conventional convolution, which
is an important factor in resource constrained environments, like mobile devices.
26
4.2.5 Squeezenet
Figure 4.15: A Fire module (figure from [34])
Squeezenet is a family of networks proposed by Iandola et al. [34]. It is aimed at resource
constrained environments especially autonomous driving vehicles. It is made up of multiple
Fire modules, where one Fire module consists of two sublayers - (1) squeeze (1x1 filters) and
(2) expand (1x1 and 3x3 layers) as illustrated in Fig. 4.15
Figure 4.16: A snapshot of Squeezenet network architecture as viewed on Netron [30]
As part of Fig. 4.16, one can see two consecutive fire modules, connected through a
concatenation operator. It is trivial to distinguish the squeeze and expand sublayers (former
contains only 1x1 filters while latter both 1x1 and 3x3).
27
Figure 4.17: Overview of Mobile NASNet search (figure from [35])
4.2.6 Mnasnet
It is interesting to note that all the above mentioned family of networks are composed of
instances of their respective building blocks (along with usual layers like Pooling, BatchNorm
etc). So, the effectiveness of a given network can be crudely determined by its building block.
Researchers at Google recognized this pattern and understood that finding the right building
block (or combination of multiple ones) is the key to model performance. Tan et al. [35]
proposed using automated Neural Architecture Search (NAS) to optimize model architecture
to achieve desirable trade off between accuracy and performance.
Fig 4.17 presents a high level overview of their methodology. The algorithm aims to
build custom blocks by searching for the most performant convolution operator, filter size,
squeeze and expand size, presence of skip operations and so on. It uses inference latency on
mobile devices as one of the inputs. Fig. 4.18 depicts one such automatically determined
architecture. One can observe that the network has skip connections like residual blocks and
depthwise followed by pointwise convolution layers like mobilenet.
4.2.7 Classifying Models
The past subsections have presented a glimpse into six popular vision models. One can
observe that each one of them has a characteristic building block. Densenet is made up of
dense blocks, Inception network of inception module and Resnet of residual block. Mobilenet
is composed of depthwise and pointwise convolutions while Squeezenet of fire modules.
Now, to aide developers in making a choice amongst such models, we claim that such
models can be crudely classified into two groups - Heavy and Light. The basis of this
classification is three fold.
The first factor is model architecture complexity. This can be quantified by the building
block used, number of layers, number of parameters and number of multiply-accumulate
28
Figure 4.18: A snapshot of Mnasnet 0.5 224 network architecture as viewed on Netron [30]
operations (MACs). The second factor is inference latency on a given device, while the third
one is storage size.
So, Heavy models would have high architecture complexity, high inference latency and
could potentially consume more storage space on the device (which would also lead to higher
bandwidth demands when using hardware acceleration). Intuitively, higher architecture
complexity should, in most cases, imply higher inference latency and storage. This is the
reason why we report inference latency and storage sizes to classify the aforementioned
models.
Fig. 4.19 presents the inference latency of all the six models run as per the stated eval-
uation configuration. One can observe that Densenet, Inception network and Resnet have
substantially higher latency when compared to the latter three models. Fig. 4.20a reports
the size of these models. Clearly, the aforementioned three models consume more space.
Assuming these two parameters as proxies for architecture complexity, we classify Densenet,
29


























Figure 4.19: Inference latency (milliseconds) of one instance each of the family of networks
described above. Evaluation configuration: 1 Framework: Tensorflow Lite 2 Hardware:
4 CPU threads 3 Datatype: Float 4 Device: Google Pixel 2
























(a) Model storage size























(b) Reported model accuracy
Figure 4.20: Model storage size and Top 5 accuracy as reported by model publishers.
Inception and Resnet as Heavy models, while Mobilenet, Squeezenet and Mnasnet as Light
models. Notably, Light models are developed for resource constrained environments, like
mobile devices.
Now, intuitively, one would always look to choose a model with minimum latency and
30
storage. But, there is a trade off that one might have to undertake when making such a
choice. Fig. 4.20b presents the reported Top 5 accuracy of each of these models. Visibly,
there is a drop in accuracy of Light models as compared to Heavy ones. Note that, these
accuracy numbers are usually reported on a validation set, which may or may not lead to
numbers representative of actual performance.
There is a trade off when choosing between Heavy and Light models. While the latter would
most definitely provide latency and storage gains, it may lead to accuracy degradation. The
acceptable extent of degradation is user’s perogative.
4.3 MODEL VERSIONING
The previous section presented six popular models, but as one might have noticed in
Fig 4.19, some of the model names had auxiliary information attached. This leads to the
second consideration a user has to make when choosing a model for his/her task - model ver-
sioning. The usual cycle of development of a model begins by the proposal of an architecture
(defined by a new building block). As the model starts getting used, the model developers
or practitioners augment the proposed architecture in one way or another. These augmen-
tations could be minor tweaks, or fundamental changes. This leads to the development of
different versions of a model architecture. This section dives into one Heavy and one Light
model to explore the implications of model versioning.
4.3.1 Inception
As mentioned in section 4.2.7, Inception network is a Heavy model. At the time of the
writing, there are five versions of the model. For simplicity, we group the first three versions
into inception v3. The other two are inception v4 and inception resnet v2.
Through inception v3, authors started using smart factorization methods to reduce com-
putational complexity, which involved reducing nxn filters into a combination of 1xn and
nx1 filters. In this way, they factoried 7x7 and 5x5 filters into smaller filters to reduce
runtime overhead. inception v4 saw the introduction of reduction blocks. The authors also
modified the stem of the network, which refers to the operations done prior to introducing
inception modules. inception resnet v2 witnessed the introduction of residual connections
to inception module, like Resnet. This led to the usage of a hybrid building block in the
network.
Fig. 4.21 illustrates the difference in runtime of the three versions. Evidently, changes

























Figure 4.21: Inference latency (milliseconds) of three version of Inception network. Eval-
uation configuration: 1 Framework: Tensorflow Lite 2 Hardware: 4 CPU threads 3
Datatype: Float 4 Device: Google Pixel 2
versions of Heavy models are usually developed to either improve accuracy (potentially at
the cost of latency) or reduce computation (keeping accuracy levels the same). This implies
that choosing a newer version of a Heavy model may not imply improvement in latency.
For Heavy models, choosing newer versions may or may not improve inference latency.
4.3.2 Mobilenet
Fig. 4.19 presented latency for mobilenet v1 0.25 128. For the class of mobilenet models,
there can be three parts to a given model version.
First signifies the architecture version number, similar to the number seen in inception
networks. At the time of the writing, there are two model versions - mobilenet v1 and
mobilenet v2, where the latter transforms the depthwise separable convolution block to a
bottleneck residual block. Fig. 4.22a illustrates the runtime difference in the two versions.
The newer version improves latency.
Second signifies the width multiplier (referred to as depth multiplier in some literature).
This hyper parameter controls the number of channels in each layer of the network. The
default value of width multiplier is 1. A value of 0.5 would imply halving the number of
channels in each layer, which should reduce the number of parameters and computation











































































Figure 4.22: Inference latency (milliseconds) of different versions of Mobilenet network.
Evaluation configuration: 1 Framework: Tensorflow Lite 2 Hardware: 4 CPU threads 3
Datatype: Float 4 Device: Google Pixel 2
when varying width multiplier. As expected, higher the width multiplier, larger number of
channels per layer, larger number of computations, leading to higher inference latency.
Some versions may have a third part, the input size. A difference in the input size and the
size of the input feature map might add a marginal overhead of resizing the input feature map
during inference. Fig. 4.22c depicts mobilenet models with different input sizes. Evidently,
having smaller input size reduces the number of computations in the network. Note that
since our input image is 224x224, for all cases except last one, there might be a minor
overhead to resize the input.
To summarize, for Light models, especially the ones developed for resource constrained
devices, there might be a set of hyper parameters to tune, which may allow the user to play
with the trade off of latency versus accuracy.
For Light models, tuning available hyper parameters may lead to significant gains in la-
tency, possibly at the cost of accuracy.
4.4 MODEL OPTIMIZATION
Once the user has narrowed down on the model type (Heavy/Light), architecture (network)
and version, he/she can further try to optimize it for inference. There are many optimization
techniques that developers use. One can reduce parameter count with Pruning and Structure
Pruning [36] [37], reduce precision of weights and/or activations using Quantization [38] [39]
or update architecture with techniques like Distillation [40]. Some of these techniques are
used when training the model, while some are used after training. This section will focus only
on Post-Training Quantization and evaluate the performance difference to motivate the need
for such optimizations for application developers. Note that this section is no way covers the
complete spectrum of techniques, but just illustrates the benefit of the most common one.
33














































Top 5 accuracy of float and integer versions of four models
(b) Top-5 accuracy
Figure 4.23: Performance comparison of floating point versus integer quantized models.
Evaluation configuration: 1 Framework: Tensorflow Lite 2 Hardware: 4 CPU threads 3
Datatype: Float and Int 4 Device: Google Pixel 2
Quantization is a technique to reduce model size and improve inference latency by convert-
ing the weights or/and activations from higher precision to lower precision representation.
Models, by default, are trained in float32 data type. So, they can be converted to float16,
int8, int4 or int1 (as in some accelerators like discrete GPUs). Apart from the data type
chosen, one can decide to quantize one or both of the weights and activations. In the case of
weight only quantization, the weights are converted back from int8 to floating point numbers
to employ float kernels. The more optimized way would be to do full integer quantization,
where both weights and activations are converted to int8 (may be lower for some accelera-
tors but not for mobile devices). Logically, this should provide massive gains in storage and
inference latency while a variable drop in accuracy performance. We consider one Heavy
and one Light model in Fig. 4.23 to explore the quantitative difference in performance.
Fig. 4.23a illustrates the difference in inference latency of two versions each of Inception
34
and Mobilenet networks. The length of orange colored bar depicts the gain in latency.
Notably, Heavy models (first two bars) show considerably more gain than Light (last two
bars) models. Also, through Fig. 4.23b, one observe that the drop in accuracy is almost
unnoticeable (of the order of 2 %).
Due to the large amount of compute within them, Heavy models might present a more
substantial gain in latency through full integer quantization when compared to Light models.
But, they may still be much slower than the latter.
4.5 MODEL INFERENCE
Chapter 3 elucidated to the fact that there are four major steps when deploying a model
in a mobile application (or any application for that matter).
Table 4.1: Break up of inference latency of three Heavy and three Light models. All numbers
are in milliseconds (ms) and have been rounded off to nearest integer.
Model Latency Loading Preprocessing Compute Postprocessing
densenet 404 29 6 396 2
inception v3 453 8 7 443 3
resnet v2 101 1125 2 7 1115 3
mobilenet v1 0.25 128 20 1 8 9 3
squeezenet 90 1 7 80 3
mnasnet 0.5 224 37 1 7 28 2
As illustrated in Fig. 3.4, first step is setting up a framework instance (equivalent to calling
New() in the mPredictor interface). The time spent on it can be labelled as Model Loading
time. The next step is performing inference (equivalent to calling Predict() in the mPredictor
interface). Now, before making the inference call, the application might have to preprocess
the input data. The time spent on preprocessing can be labelled as Data Preprocessing
time. Obviously, time spent on inference call can be labelled as Model Compute time.
The final step involved in getting application parse-able output prediction is equivalent to
calling ReadPredictionOutput() in the mPredictor interface. This can be categorized as Data
Postprocessing time. In this way, every model deployment can be evaluated by breaking up
model inference into four parts - Model Loading, Data Preprocessing, Model Compute and
Data Postprocessing, where Model Loading would be a one time cost and while the rest three
occur once for every inference run.
Table. 4.1 presents the inference latency of three Heavy and three Light models (same as





Model Loading Data Preprocessing Model Compute Data Postprocessing
Figure 4.24: Break up of inference latency (milliseconds) of three Heavy models. Evaluation
configuration: 1 Framework: Tensorflow Lite 2 Hardware: 4 CPU threads 3 Datatype:
Float 4 Device: Google Pixel 2
loading the same model. The rest three have been averaged over 100 consecutive evaluation
runs and then added up to provide inference latency. Fig. 4.24 illustrates the break up
graphically for Heavy models while Fig. 4.25 does the same for Light models. The actual
numbers have been intentionally omitted in the plots for cleaner viewing.
For Heavy models, Model Loading, Data Preprocessing and Data Postprocessing contribute
negligibly to overall inference latency. This can be inferred from the graph since red, skin
and dark green colors are hardly visible. Notably, Model Loading for densenet is relatively




Model Loading Data Preprocessing Model Compute Data Postprocessing
Figure 4.25: Break up of inference latency (milliseconds) of three Light models. Evaluation
configuration: 1 Framework: Tensorflow Lite 2 Hardware: 4 CPU threads 3 Datatype:
Float 4 Device: Google Pixel 2
For Light models, Data Preprocessing and Data Postprocessing become reasonbly impor-
tant, even of the same order as Model Compute for very light models like mobilenet and
36
mnasnet. Interestingly, Model Loading is very similar for all Light models. Note that nu-
merical values have been rounded off to nearest value for better viewing, which implies that
there might be a small difference in reality.
In general, Model Loading is relatively unimportant in terms of performance implications,
especially because it is a one time cost endured every time the application is started. It is
framework dependent, since internally it involves parsing the model file which is specific to
the implementation of the framework.
Data Preprocessing and Data Postprocessing are dependent on the Deep Learning task
being accomplished through the deployment. It involves transferring application data into
mPredictor compatible data structure, normalizing data and so on. The values in this case
are specific to image classification task. Lastly, Model Compute is representative of the
model used.
Breaking up model inference into aforementioned components might be more important for
Light models as compared to Heavy models. As one optimizes Model Compute, after some
point, other components like Pre/post processing may become important.
4.6 HARDWARE ACCELERATION
(a) Cumulative Distribution
Function (CDF) of SoC mar-
ket share
(b) OpenCL support
(c) OpenGL ES suppprt
Figure 4.26: Plots represent massive fragmentation in mobile hardware and software space
(figures from [7])
Chapter 2 discussed that there are many possible hardware backends on mobile chipsets.
Apart from CPUs, modern mobile System on Chip (SoC) have mobile GPUs, compute DSPs
and even Deep Learning specific accelerators. While it may seem very enticing and lucrative
to target one or more of these hardware backends, there are many issues plaguing the mobile
space.
Firstly, there is massive diversity of mobile chipsets found in the world. Fig. 4.26a illus-
37
trates that only 30 SoCs have market share greater than 1%, and all those SoCs comprise
of only 51% market share. This diversity arises from the fact that there are usually many
Intellectual Properties (IPs) on a mobile SoC provided by multiple vendors. This heterogen-
ity leads to the creation of many different types of SoCs. This implies that mobile hardware
space is highly fragmented and hence there is no ”typical” mobile device to optimize for as
pointed out by Wu et al. [7].
Secondly, programmability of these diverse hardware backends is a major bottleneck
for mobile developers. The software infrastructure available is very fragmented and ill-
maintained. As Fig. 4.26b and Fig. 4.26c point out, there are many languages used to
program heterogeneous hardware, with each one of them having multiple versions. In fact,
many mobile devices have such libraries either broken or completely absent.
Given how fragmented mobile hardware and software space is, mobile CPUs are most used
hardware target for Deep Learning inference [7], and hence most relevant from performance
evaluation perspective for a generic mobile application developer. This is why the next sub-
section will be providing an in-depth look into inference acceleration on mobile CPUs, after
which we will look into using Neural Network API (NNAPI) for heterogeneous acceleration.
4.6.1 CPU Multithreading
Central Processing Unit (CPU) is the central computation hardware on a mobile SoC. It is
the default backend for any sort of computation unless a co-processor (any IP on the mobile
SoC) is thought to be a better target. For a long time, a single core CPU was enough to run
programs, but with increase in compute requirements, multi-core CPUs became omnipresent.
One can find different core counts in production CPUs - dual(2), quad(4), octa(8) and so
on. Today most of the mobile CPUs in the market have atleast four cores, if not more.
Mobile CPUs are used by end users more frequently than desktop or serve CPUs. Due
to their end user facing structure, they face a constant demand to improve on performance.
This is why they have a faster iteration rate of development than desktop CPUs [41]. Gener-
ally, they inherit generational microarchitectural enhancements, memory hierarchy growth,
increased clock frequency-like features from the desktop CPU space. This has seen more
than 10X peformance improvement in the past ten years [41]. But, an interesting trend
observed in mobile CPUs is that they tend to pass up on many desktop CPU performance
optimizations for better energy efficiency. This is a very important distinction to note, which
will be looked at in more detail in the next section.
As with desktop CPUs, multiple cores are preferred to single core for multiple reasons.
Many modern mobile applications rely on availability of multiple threads to extract sufficient
38
parallelism. Also, it can mitigate resource contention between application and backgroudn
threads by placing background work on a separate core, not affecting user experience. Lastly,
as elucidated in the previous subsection, relying on mobile CPUs is the safest bet for a mobile
developer for inference due to their end user interaction which will force them to keep on
improving, as opposed to other IPs which might develop at a slower rate [41].
Almost all recent mobile CPUs implement ARM’s big.LITTLE architecture [42]. Such a
CPU comprises of some number of big cores and some number of LITTLE cores both of
whom have the same architecture (meaning implement the same ISA) but have different
microarchitecture. This leads to performance energy trade offs between them, with big
cores being more performant while LITTLE being more energy efficient. Historically, big
cores are complex, Out of Core, multi issue pipelines akin to desktop/server CPUs, while
LITTLE cores are simple, in order, multi stage pipeline. This heterogeneity in mobile CPUs
makes them very different from desktop CPUs [43]. Such a design choice helps manage the
constant user demand for performance, while still keeping energy consumption on track. So,
now there are two types of compute cores to choose from for a given task (mapped to a
software thread). This choice is made by the Operating System (OS) through one of two
software models.
CPU Migration is first type of model where each big core is paired with a LITTLE core.
The OS sees each pair of big LITTLE core as a single virtual core, places the task on one
such virtual core based on current load demands, and then the thread runs on one of the
two cores where the idle core is turned off (usually through clock gating).
The second type of model is called Global Task Scheduling (GTS). Here, each big or
LITTLE core is a separate entity, and the scheduler is aware of each of their compute
and energy performance differences. The scheduler keeps track of the load of each thread
through a load metric. This metric is computed as a historical weighted average across
threads’ running time [43]. Note higher weights are given to more recent runs. Now, if
a thread is running on a LITTLE core, scheduler checks the load metric periodically and
compares it with a ”up migration threshold”. If load metric exceeds up migration metric,
the thread is migrated to a big core. Similarly, if a thread is running on a big core, load
metric is compared with the ”downward migration threshold”. If it becomes smaller than
the threshold, thread is migrated from big to a LITTLE core.
The design of GTS software model allows any number of core on in the CPU unlike
CPU Migration which requires equal number of big and LITTLE cores. GTS allows any
number of cores to be active while CPU migration allows only half of them to be active.
Most importantly, GTS allows managing both the cores as different clusters [43]. These
advantages make employing GTS more lucrative, which is almost all CPUs in the recent
39
years have GTS based thread scheduling.
GTS employs many software thread affinity management techniques. Fork Migration
operates when a fork system call is made. This happens when a new thread is created.
Such a thread defaults to big core. Wake migration deals with an idle thread being woken
up. GTS uses previously active tracked load history of the thread to map it to a core.
Forced migration takes care of long running threads , continuously tracking if they need to
be migrated based on its load metric. Idle Pull migration tracks the state of every big core,
and tried to ”pull” threads from LITTLE cores in case a big core is sitting idle.
These strategies become especially active when amount of work becomes large, which is
often the case when running standarized benchmarks like GeekBench [20] and Antutu [21].
But, notably, having such large amount of work might not be the average case when a general
uses the device. This implies that the performance reported by mobile benchmark suites
might not be the most accurate numbers because they might not be covering average case.
Since the scenario of thread migration is very common, such mobile SoCs have hardware
coherency protocols enabled to alleviate the overhead of constant data transfer through the
main memory. Finally, as one might have observed, GTS scheduling is done by the OS,
which makes it important when considering performance.
We will consider two popular mobile devices for evaluation - Google Pixel 2 and Samsung
Galaxy S9.
Google Pixel 2 has a Qualcomm Snapdragon 835 chipset which contains ARM Kryo 280
CPUs. These are semi-custom ARM cores based on ARM’s Built on ARM Cortex technol-
ogy. This means Qualcomm made certain changes to the stock ARM core, such as branch
predictor, instruction fetch and other frontend components. Usually, there are some fea-
tures which are off limit for manufacturers such as decoder width, execution pipeline. It is
an octa-core CPU with four big (ARM Cortex A73) and four LITTLE cores (ARM Cortex
A53). The clock frequency for big core limits at 2.45 GHz while LITTLE at 1.9 GHz.
Fig. 4.27 illustrates performance at different number of threads for Heavy and Light mod-
els. In Fig. 4.27a, one can observe that densenet and inception v3 perform best with 4
threads while resnet v2 101 with 6 threads. Fig. 4.27b depicts that mobilenet is best with 1
thread, mnasnet with 2 threads while squeezenet with 6 threads.
Samsung Galaxy S9 has Qualcomm Snapdragon 845 SoC comprising of Kryo 385 CPU
cores. It also has four big and four efficient cores, but one generation ahead when compared
to Kryo 280 (A75 and A55 respectively). It employs ARMs DynamiQ cluster technology
which means that it places all cores into one large cluster against having multiple clusters as
in Kryo 280. This implies that each core now gets a private L2 cache, a shared L3 cache and
a system wide cache which sits at memory controller/interconnect level. So, one can expect
40
1 2 4 6 8






















1 2 4 6 8























Figure 4.27: Performance implications of varying number of CPU threads for three Heavy
and Light models on Google Pixel 2
it to gain more from data locality and memory reuse. Also, it has a much improved floating
point execution pipeline, which should translate to gain in floating point computations.
Fig. 4.28 illustrates that all models follow similar speedup/slowdown pattern as viewed in
Fig. 4.27.
41
1 2 4 6 8

























1 2 4 6 8























Figure 4.28: Performance implications of varying number of CPU threads for three Heavy
and Light models on Samsung Galaxy S9
Notably, Heavy models perform optimum with high number of threads, which makes
sense due to the amount of compute. But interestingly, most of them perform best at 4 or
6 threads (except resnet v2 101 on Galaxy S9). This is less than the total number of cores
(8). This could be because it now gives GTS some idle cores to do effective thread migration
42
as opposed to when all cores are busy.
Light models seem to run best with either 1 or 2 threads. But, even though they would
have less compute against Heavy, intuitively one would think that these kind of computer
vision models should be able to take higher number of threads. This implies that, on mobile
CPUs, if amount of compute work is below a certain threshold, placing all work on one or
maximum two cores might be a better strategy than using up multiple cores. This should aide
in avoiding the overhead of constant thread migration, or more formerly the GTS scheduling
overhead. This would make more sense if the former are big cores. Squeezenet, despite being
a Light model, seems to function optimal at 6 threads. This makes it an anomaly.
The case of resnet v2 101 on Galaxy S9 illustrates that using number of threads as 4 or
6 as a rule of thumb for Heavy models might not be best always. The presence of GTS
scheduling at OS level makes predicting performance on mobile CPUs a bit tricky. The case
of squeezenet shows that it might not be necessary that a Light models would also function
with lesser number of threads.
All in all, one can begin with the notion that Heavy models need high (not maximum)
number of threads while Light need very low number of threads, but one should be open to
see performance anomalies going forward. This uncertainty is attributed to heterogeneity in
mobile CPUs and its associated scheduling logic.
Fig. 4.29 compares performance of Heavy models across both devices. One can note that
for densenet and inception v3 using lower number of threads provides similar performance
while high number of threads start reflecting difference in performance with Galaxy S9 doing
better as expected. resnet v2 101 performs better on Galaxy S9 irrespective.
Fig. 4.30 shows for Light models. Interestingly, for mobilenet and mnasnet, it is very
difficult to differentiate between the devices, while for squeezenet, Galaxy S9 is much better.
This eludes to a general consensus that Heavy models should perform better on better mobile
CPU as they have considerable amount of compute. As for Light models, it is difficult to
observe gains in device performance due to lack of compute. Of course, given that both the
devices have a generation different, one can expect to see less gains in general.
One can expect to observe difference in performance with improvement of mobile CPU
hardware for Heavy models, while this may or may not be true for Light models.
4.6.2 NNAPI
Neural Network Application Programming Interface (NNAPI) is an Android C API de-
signed for running computationally intensive operations for Deep Learning on Android mo-
bile devices [44]. It provides a layer of abstraction between software frameworks and hard-
43
1 2 4 6 8

























1 2 4 6 8





















1 2 4 6 8





















Inference latency of resnet_v2_101 on Pixel 2 and Galaxy S9
Pixel 2
Galaxy S9
(c) resnet v2 101
Figure 4.29: Performance comparison of CPUs of Google Pixel 2 and Galaxy S9 across three
Heavy models
ware backends. Fig. 4.31 presents its system architecture. Visibly, the framework interacts
with NNAPI runtime to schedule computational operations on all available hardware back-
ends. The runtime checks for the availability through the presence of vendor driver for a
given accelerator. It is the responsibility of the SoC developers to provide this vendor. Based
on all available accelerators, runtime distributes workload amongst them. Notably, CPU is
also one of the potential hardware targets, and the fallback option if no other backend is
found.
Fig. 4.32 compares the performance between the best runtime provided by the CPU (opti-
mal number of threads for the given model and hardware) and NNAPI for both Google Pixel
2 and Samsung Galaxy S9. In Fig. 4.32a, one can observe that there is a considerable drop
in performance for all except resnet v2 101 on Galaxy S9. In Fig. 4.32b, for mobilenet and
mnasnet, it is diffcult to differentiate between best CPU and NNAPI, while for squeezenet,
each one of them share the plaudits. So, for Heavy models, where there is massive amount
of compute, NNAPI is considerably poor than optimal performance on the CPU, while for
Light models, where there is much less compute, both provide similar performance. In a
44
1 2 4 6 8




















Inference latency of mobilenet_v1_0.25_128 on Pixel 2 and Galaxy S9
Pixel 2
Galaxy S9
(a) mobilenet v1 0.25 128
1 2 4 6 8
























1 2 4 6 8




















Inference latency of mnasnet_0.5_224 on Pixel 2 and Galaxy S9
Pixel 2
Galaxy S9
(c) mnasnet 0.5 224
Figure 4.30: Performance comparison of CPUs of Google Pixel 2 and Galaxy S9 across three
Light models
nutshell, more the compute, poorer does NNAPI perform.
There could be two reasons to this. Firstly, as mentioned earlier, NNAPI defaults to CPU
in case of absence of vendor drivers for all accelerators on the device. By the looks of it, the
devices under consideration face this issue. So, effectively both best CPU and NNAPI com-
putations run on the CPU. Intuitively, one would be forced to think that if they are running
on the same hardware backend, such a drop in performance should not be observed. This
could be attributed to NNAPI’s execution model. The runtime figures out hardware backend
for every computational node in the model graph. This, when coupled with communication
between framework (Tensorflow Lite) and runtime, adds massive amount of scheduling over-
head to the execution. In simpler terms, if both of them are going to run on the CPU,
NNAPI is going to perform worse due to scheduling and communication overhead.
Secondly, NNAPI was first released with Android 9 (API 27), which is why one can say
that it is relatively immature in its development. Here, Pixel 2 was running Android 10
while Galaxy S9 Android 9. Interestingly, since this is an OS feature, its dependence on OS
version makes performance estimation non-trivial.
45
Figure 4.31: System architecture for Android NNAPI (figure from [44])
Lastly, we should raise a caveat regarding this comparison. The plots have compared best
CPU performance against NNAPI, which have showed NNAPI in pretty bad light. The catch
is that it is very difficult for one, especially a mobile application developer, to determine
optimal number of threads for a given model. So, on an average case, NNAPI might not
perform as bad in comparison.
The advantage of NNAPI is that it aims to provide programmability to such diverse
hardware systems. It does the heavy lifting for a user, and the software framework to map
out computations to relevant available backends. So, in due course of time, as its development
matures, it could become the go-to infrastructure for Deep Learning acceleration on mobile
devices.
NNAPI might show considerable drop in performance for Heavy models, while comparable
performance for Light models. This implies that a user should prefer to stay with multi

























































Figure 4.32: Comparison of best CPU performance with NNAPI for three Heavy and Light
models
4.6.3 Quantization
Fig. 4.33 illusrates the effect of multithreading on the CPU with respect to float and
quantized models. Each plot scans the performance of both float and quantized versions of
the model with changing number of threads. First two models are Heavy models while the
47
1 2 4 6 8
























1 2 4 6 8
























1 2 4 6 8




















Inference latency of float and quantized mobilenet_v1_0.25_128 with varying CPU threads
mobilenet_v1_0.25_128
mobilenet_v1_0.25_128_quant
(c) mobilenet v1 0.25 128
1 2 4 6 8

















Inference latency of float and quantized mobilenet_v1_0.5_160 with varying CPU threads
mobilenet_v1_0.5_160
mobilenet_v1_0.5_160_quant
(d) mobilenet v1 0.5 160
Figure 4.33: Comparison of float and quantized models with varying number of threads
on the CPU. Evaluation configuration: 1 Framework: Tensorflow Lite 2 Device: Google
Pixel 2
last twp Light.
One can observe that Heavy quantized models gain with increase in number of threads,
peaking at 4 or 6 threads just like Heavy float models. Light quantized models, like their
float counterparts, perform best with very few number of threads. So, the scaling effect of
multithreading is same for both float and corresponding quantized models.
The more interesting inference is that the relative difference between float and quantized
models remains similar with change in number of threads, especially for Heavy models. This
makes sense because quantization modifies the data type used for numerical representation,
which obviously improves runtime, but the number of compute operations remain the same.
This implies that from a user’s perspective, effect of multithreading is agnostic to data type
used.
The effect of scaling due to increase in number of threads on the CPU is independent of
the data type used in the model. This means that one can extrapolate the relative change in
performance from one data type to another.
48
4.6.4 Performance Variation



















Deviation in inference latency of densenet in 100 consecutive runs
deviation
(a) densenet





















Deviation in inference latency of inception_v3 in 100 consecutive runs
deviation
(b) inception v3




















Deviation in inference latency of mobilenet_v1_0.25_128 in 100 consecutive runs
deviation
(c) mobilenet v1 0.25 128






















Deviation in inference latency of mnasnet_0.5_224 in 100 consecutive runs
deviation
(d) mnasnet 0.5 224
Figure 4.34: Performance variation of two Heavy and Light models over 100 consecutive
inferences. X-axis represents inference number, from 1 to 100 in multiples of 10. Red line
depicts the average latency across the 100 runs. Each point shows relative difference from
the average. Evaluation configuration: 1 Framework: Tensorflow Lite 2 Device: Google
Pixel 2 3 Hardware: 4 CPU threads 4 Datatype: Float
A previous section discussed the presence of heterogeneity in multi core mobile CPUs
(ARM’s big.LITTLE) and the consequent usage of Global Task Scheduling (GTS) software
model for thread management. The GTS, implemented as part of the Operating System,
maintains a load metric per thread which is primarily dictated by the thread’s running time
and core affinity.
All modern mobile CPUs use a process called Thermal Throttling, where in order to
conserve battery, there is a reduction in the amount of voltage supplied to CPU based on
the workload being run. This is usually controlled through frequency scaling techniques.
So, if the SoC is experiencing high amount of heat dissipation, the cycle count, through
frequency, is throttled, leading to less thermal heat emission, but at the same time, slowing
down CPU performance. This process is controlled by OEM driver. Evidently, there is a
fine line when using such a methodology. If the driver throttles too much, the user might
observe very bad performance, which would not reflect well on the hardware underneath.
49
But, if it throttles less, phone battery might die, hampering user experience.





















Deviation of densenet in 100 consecutive runs with CPU threads=1
deviation
(a) densenet





















Deviation of inception_v3 in 100 consecutive runs with CPU threads=1
deviation
(b) inception v3




















Deviation of mobilenet_v1_0.25_128 in 100 consecutive runs with CPU threads=1
deviation
(c) mobilenet v1 0.25 128


















Deviation of mnasnet_0.5_224 in 100 consecutive runs with CPU threads=1
deviation
(d) mnasnet 0.5 224
Figure 4.35: Performance variation of two Heavy and Light models over 100 consecutive
inferences with 1 CPU thread. X-axis represents inference number, from 1 to 100 in multiples
of 10. Red line depicts the average latency across the 100 runs. Each point shows relative
difference from the average. Evaluation configuration: 1 Framework: Tensorflow Lite 2
Device: Google Pixel 2 3 Hardware: 1 CPU thread 4 Datatype: Float
The aforementioned processes within a mobile SoC might bring about a high degree of
unpredictability when analyzing Deep Learning performance. The thread migration deci-
sions made by the GTS and thermal throttling done by the OEM driver might lead to large
variations in reported performance. Note that although one may be able to control throt-
tling frequency (based on OS, SoC and device used), it would not approximate real time
performance, since it is an uncontrollable parameter during application deployment. This
is why we make multiple benchmarking runs, where each run consists of 100 consecutive
inferences. Making multiple benchmarking runs separately should mimic separate real time
scenarios, while using consecutive runs should provide a good proxy for performance. The
hope is that 100 inferences should be small enough so that one does not observe the ef-
fects of thermal throttling, while it is large enough to provide a stable performance metric.
Obviously, number of consecutive inferences is a runtime parameter controllable by the user.
Fig. 4.34 presents the variation in performance over 100 consecutive inference runs for two
Heavy and Light models. In each plot, red bar depicts the average over all the 100 runs,
50


















Deviation of densenet in 100 consecutive runs with CPU threads=6
deviation
(a) densenet


















Deviation of inception_v3 in 100 consecutive runs with CPU threads=6
deviation
(b) inception v3





















Deviation of mobilenet_v1_0.25_128 in 100 consecutive runs with CPU threads=6
deviation
(c) mobilenet v1 0.25 128





















Deviation of mnasnet_0.5_224 in 100 consecutive runs with CPU threads=6
deviation
(d) mnasnet 0.5 224
Figure 4.36: Performance variation of two Heavy and Light models over 100 consecutive in-
ferences with 6 CPU threads. X-axis represents inference number, from 1 to 100 in multiples
of 10. Red line depicts the average latency across the 100 runs. Each point shows relative
difference from the average. Evaluation configuration: 1 Framework: Tensorflow Lite 2
Device: Google Pixel 2 3 Hardware: 6 CPU threads 4 Datatype: Float
and hence each data point represents difference with respect to the reported average. Note
that Y axis of each subplot has a different scale. So, one should take that into consideration
when inferring from the plot.
In each subplot, one can observe that the runtime is fairly less to begin with. This is
because, as explained in an earlier section, a thread when forked is placed on a big core.
Then, in due course of time, inference latency settles down to its average behaviour. Notably,
Light models may see a much higher percentage variance in performance as the effects of
GTS and thermal throttling are magnified due to their smaller runtimes.
We explore the effects of GTS through Fig. 4.35, Fig. 4.36 and Fig. 4.37 where the number
of CPU threads have been varied. Evidently, Fig. 4.35 illustrates much less variation in
performance as compared to other scenarios. This is because it uses one CPU thread, which
minimizes the scope of thread migration and management. Interestingly, Fig. 4.37 illustrates
a higher percentage variation due to potentially larger thread management overhead.
For aforementioned reasons, one can expect to see reasonable variance in runtime perfor-
mance. The reason behind the variance could be mobile device, chipset, Operating System
51


















Deviation of densenet in 100 consecutive runs with CPU threads=8
deviation
(a) densenet





















Deviation of inception_v3 in 100 consecutive runs with CPU threads=8
deviation
(b) inception v3


















Deviation of mobilenet_v1_0.25_128 in 100 consecutive runs with CPU threads=8
deviation
(c) mobilenet v1 0.25 128


















Deviation of mnasnet_0.5_224 in 100 consecutive runs with CPU threads=8
deviation
(d) mnasnet 0.5 224
Figure 4.37: Performance variation of two Heavy and Light models over 100 consecutive in-
ferences with 8 CPU threads. X-axis represents inference number, from 1 to 100 in multiples
of 10. Red line depicts the average latency across the 100 runs. Each point shows relative
difference from the average. Evaluation configuration: 1 Framework: Tensorflow Lite 2
Device: Google Pixel 2 3 Hardware: 8 CPU threads 4 Datatype: Float
version, benchmarking conditions and so on. This makes reporting reproducible benchmark
results difficult. This could be one reason why people report performance with 1 CPU
thread. But, we feel that, while doing so might eliminate thread scheduling effects, it might
not present the true performance comparison, as we showed that Light models perform best
with less number of threads while Heavy ones require large number of threads (not the
maximum). So, as long as one is comparing Light models only, using ! CPU thread might
be representative. But, for Heavy models (with/without Light models), using less num-
ber of threads might show them in worse light than they truly might be. For this reason,
we decided to depict comparative performance using 4 threads. In the end, we feel that
given the fragmented nature of mobile software and hardware space, one should not take
such benchmarking results for their true numbers, but should rather focus on the relative
learnings.
The presence of techniques like Thermal Throttling and Global Task Scheduling makes
benchmarking mobile devices difficult. Moreover, usage of large number of CPU threads may
lead to larger variance in observed performance due to increased scheduling opportunities
52
and overhead. To summarize, given the fragmented nature of mobile space, one should infer
trends and relative comparisons rather than true performance values from benchmarking
studies on mobile devices.
53
CHAPTER 5: CONCLUSION
The thesis focused on Deep Learning on mobile devices. Chapter 2 surveyed the space
and identified three major problems hampering integration of Deep Learning into mobile
applications.
Chapter 2.1 presented the large amount of options a developer has in the space of software
frameworks and hardware backends for Deep Learning. It motivated the need for an interface
which lowers programmability overhead. Chapter 3.2 presented such an interface, called a
mobile Predictor (mPredictor). This interface has been derived from MLModelScope [1]. It
consists of four major API calls - New, Predict, ReadPredictionOutput and Delete. Of course,
it can have additional API calls, but this comprises of the minimal set required. Chapter 3.3
described how one can implement a framework mPredictor. Once implemented, a frame-
work mPredictor can be used by any application developer through methods illustrated in
Chapter 3.4.
The thesis motivated the need for an open, user-customizable benchmarking system for
Deep Learning on mobile devices in Chapter 2.2. Chapter 4.1 presented MLModelScope
mobile agent as one such system. The user can choose a framework, a built-in model, sup-
ported hardware backend and datatype to perform an experimental run. The agent currently
outputs two metrics - inference latency and throughput. Moreover, it provides the model
loading time, model coldstart, data preprocessing, model compute and data postprocessing
time, many of which make up the reported inference latency. The presented data should
provide an in-depth picture of the experiment performed to the user.
Lastly, Chapter 2.3 motivated the need for a publicly available fine grained performance
study of Deep Learning on mobile devices capable of guiding a generic application developer
in integrating such a service in his/her application. Chapter 4 presented such an evaluation.
Chapter 4.2 described six popular image classification models and then classified them Heavy
and Light models based on their model complexity, inference latency and model size. It
showed that there is a latency-accuracy trade off when choosing between Heavy and Light
models.
Chapter 4.3 illustrated that a given model architecture can have multiple versions, where
performance can vary from one version to another. Chapter 4.4 presented the potential gain
from full integer quantization as a model optimization methodology. It showed that Heavy
models might gain more from such optimizations. Chapter 4.5 broke up a model deployment
into multiple parts. It illustrated that such a division can help gain more understanding for
execution of Light models.
54
Chapter 4.6 discussed the effects of accelerating such workloads on mobile CPUs . It
presented an inference that Heavy models generally perform better with high number of
threads (not as many as the number of cores), while Light models with few threads. It also
illustrated that an improvement in CPU hardware may or may not show improvement in
Light model performance. Moreover, it illustrated that NNAPI might not be mature enough
to overtake best CPU performance. Also, increase in number of threads would provide similar
performance gains for models of different datatypes. Finally, it illustrated the presence of
variation in performance due to uncontrollable factors like Global Task Scheduling (GTS),
Thermal Throttling and so on, on mobile devices.
Future Work. While the thesis presents a good overview of Deep Learning on mobile
space, there are aspects that can be looked at further. The future work involves performing
comparative study between three software frameworks - Tensorflow Lite, Qualcomm SNPE
and TVM. It also involves comparing runtime and energy benefits of CPU, GPU and DSP.
All the evaluations in the thesis focused on the scenario of one image inference request. The
extension could involve using batch size of more than one, which might be the case in some
applications. Moreover, one might want to pursue other tasks like object detection, image
enhancement and so on. The thesis should provide a good base for all such future analysis.
55
REFERENCES
[1] A. Dakkak, C. Li, A. Srivastava, J. Xiong, and W. W. Hwu, “Mlmodelscope: Evaluate
and measure ML models within AI pipelines,” CoRR, vol. abs/1811.09737, 2018.
[Online]. Available: http://arxiv.org/abs/1811.09737
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444,
May 2015.
[3] A. Esteva and others., “Dermatologist-level classification of skin cancer with deep neural
networks,” Nature, vol. 542, pp. 115–118, Feb. 2017.
[4] S.-C. Lin, Y. Zhang, C.-H. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars,
“The architectural implications of autonomous driving: Constraints and acceleration,”
SIGPLAN Not., vol. 53, no. 2, pp. 751–766, Mar. 2018. [Online]. Available:
http://doi.acm.org/10.1145/3296957.3173191
[5] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deep neural networks:
A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, Dec
2017.
[6] Bankmycell, “How many people have phones worldwide,” website, 2019. [Online].
Available: https://www.bankmycell.com/blog/how-many-phones-are-in-the-world
[7] C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood,
E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak,
F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo,
and P. Zhang, “Machine learning at facebook: Understanding inference at the edge,”
in 2019 IEEE International Symposium on High Performance Computer Architecture
(HPCA), Feb 2019, pp. 331–344.
[8] M. D. Hill and V. J. Reddi, “Gables: A roofline model for mobile socs,” in
25th IEEE International Symposium on High Performance Computer Architecture,
HPCA 2019, Washington, DC, USA, February 16-20, 2019, 2019. [Online]. Available:
https://doi.org/10.1109/HPCA.2019.00047 pp. 317–330.
[9] ARM, “Arm financial results,” ARM Website, 2018. [Online]. Available:
https://www.arm.com/company/investors
[10] Apple, “Apple a12 bionic,” Apple Website, 2019. [Online]. Available:
https://www.apple.com/shop/buy-iphone/iphone-xr
[11] AnandTech, “Cambricon, makers of huawei’s kirin npu ip, build a




[12] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale
machine learning,” in 12th USENIX Symposium on Operating Systems De-
sign and Implementation (OSDI 16). Savannah, GA: USENIX Association,
Nov. 2016. [Online]. Available: https://www.usenix.org/conference/osdi16/technical-
sessions/presentation/abadi pp. 265–283.
[13] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for
heterogeneous distributed systems,” CoRR, vol. abs/1512.01274, 2015. [Online].
Available: http://arxiv.org/abs/1512.01274
[14] Facebook, “Pytorch,” Pytorch Website, 2019. [Online]. Available: https://pytorch.org
[15] Google, “Tensorflow lite,” TensorflowLite Website, 2019. [Online]. Available:
https://www.tensorflow.org/lite
[16] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen,
L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An
automated end-to-end optimizing compiler for deep learning,” in Proceedings of
the 12th USENIX Conference on Operating Systems Design and Implementation,
ser. OSDI’18. Berkeley, CA, USA: USENIX Association, 2018. [Online]. Available:
http://dl.acm.org/citation.cfm?id=3291168.3291211 pp. 579–594.
[17] Facebook, “Glow,” Glow Website, 2019. [Online]. Available:
https://ai.facebook.com/tools/glow/
[18] Qualcomm, “Qualcomm neural processing sdk,” Qualcomm Website, 2019. [Online].
Available: https://developer.qualcomm.com/software/qualcomm-neural-processing-
sdk
[19] Apple, “Apple coreml,” Apple Website, 2019. [Online]. Available:
https://developer.apple.com/documentation/coreml
[20] Primate, “Geekbench,” GeekBench Website, 2019. [Online]. Available:
https://www.geekbench.com
[21] AnTuTu, “Antutu benchmark,” AnTuTu Website, 2019. [Online]. Available:
http://www.antutu.com/en/
[22] A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and
L. V. Gool, “Ai benchmark: All about deep learning on smartphones in 2019,” 2019.
57
[23] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J. Wu, B. Anderson,
M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Coleman, S. Davis, P. Deng,
G. Diamos, J. Duke, D. Fick, J. S. Gardner, I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao,
T. S. John, P. Kanwar, D. Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micike-
vicius, C. Osborne, G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun,
H. Tang, M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong,
P. Zhang, and Y. Zhou, “Mlperf inference benchmark,” 2019.
[24] Google, “Android os,” Android Website, 2019. [Online]. Available:
https://www.android.com
[25] Apple, “ios,” Apple Website, 2019. [Online]. Available: https://www.apple.com/ios/ios-
13/
[26] Google, “The go programming language,” Golang Website, 2019. [Online]. Available:
https://golang.org
[27] Google, “Bazel - a fast, scalable, multi-language and extensible build system,” Bazel
Website, 2019. [Online]. Available: https://bazel.build
[28] Google, “Mobile - golang/go wiki,” Gomobile Website, 2019. [Online]. Available:
https://github.com/golang/go/wiki/Mobile
[29] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,”
CoRR, vol. abs/1608.06993, 2016. [Online]. Available: http://arxiv.org/abs/1608.06993
[30] L. Roeder, “Netron,” Netron website, 2019. [Online]. Available:
https://github.com/lutzroeder/netron
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol.
abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,”
CoRR, vol. abs/1603.05027, 2016. [Online]. Available: http://arxiv.org/abs/1603.05027
[33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available:
http://arxiv.org/abs/1704.04861
[34] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,”
CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360
[35] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware
neural architecture search for mobile,” CoRR, vol. abs/1807.11626, 2018. [Online].
Available: http://arxiv.org/abs/1807.11626
58
[36] E. J. Crowley, J. Turner, A. Storkey, and M. O’Boyle, “A closer look at structured
pruning for neural network compression,” 2018.
[37] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network
pruning,” in International Conference on Learning Representations, 2019. [Online].
Available: https://openreview.net/forum?id=rJlnB3C5Ym
[38] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and
D. Kalenichenko, “Quantization and training of neural networks for efficient integer-
arithmetic-only inference,” CoRR, vol. abs/1712.05877, 2017. [Online]. Available:
http://arxiv.org/abs/1712.05877
[39] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient infer-
ence: A whitepaper,” CoRR, vol. abs/1806.08342, 2018. [Online]. Available:
http://arxiv.org/abs/1806.08342
[40] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
2015.
[41] M. Halpern, Y. Zhu, and V. J. Reddi, “Mobile cpu’s rise to power: Quantifying the
impact of generational mobile cpu design trends on performance, energy, and user sat-
isfaction,” in 2016 IEEE International Symposium on High Performance Computer
Architecture (HPCA), March 2016, pp. 64–76.
[42] A. Butko, F. Bruguier, D. Novo, A. Gamatié, and G. Sassatelli, “Exploration of perfor-
mance and energy trade-offs for heterogeneous multicore architectures,” 2019.
[43] ARM, “big.little technology: The future of mobile,” ARM website, 2013. [Online]. Avail-
able: https://www.arm.com/files/pdf/bigLITTLETechnologytheFutueofMobile.pdf
[44] Google, “Neural networks api,” Android website, 2019. [Online]. Available:
https://developer.android.com/ndk/guides/neuralnetworks/
59
