QuTiBench: Benchmarking Neural Networks on Heterogeneous Hardware by Blott, Michaela et al.
1TiBench: Benchmarking Neural Networks on Heterogeneous
Hardware
MICHAELA BLOTT, Xilinx Research
LISA HALDER, Xilinx Research, Ulm University
MIRIAM LEESER, Northeastern University
LINDA DOYLE, Trinity College Dublin
Neural Networks have become one of the most successful universal machine learning algorithms. ey play a
key role in enabling machine vision and speech recognition, and are increasingly adopted in other application
domains. eir computational complexity is enormous and comes along with equally challenging memory
requirements both in regards to capacity and access bandwidth, which limits deployment in particular within
energy constrained, embedded environments. In order to address these implementation challenges, a broad
spectrum of new customized and heterogeneous hardware architectures have emerged, oen accompanied
with co-designed algorithms to extract maximum benet out of the hardware. Furthermore, numerous
optimization techniques are being explored for neural networks to reduce compute and memory requirements
while maintaining accuracy. is results in an abundance of algorithmic and architectural choices, some of
which t specic use cases beer than others.
For system level designers, there is currently no good way to compare the variety of hardware, algorithm and
optimization options. While there are many benchmarking eorts in this eld, they cover only subsections of
the embedded design space. None of the existing benchmarks support essential algorithmic optimizations such
as quantization, an important technique to stay on chip, or specialized heterogeneous hardware architectures.
We propose a novel benchmark suite, TiBench, that addresses this need. TiBench is a novel multi-tiered
benchmarking methodology (Ti) that supports algorithmic optimizations such as quantization () and helps
system developers understand the benets and limitations of these novel compute architectures in regard to
specic neural networks and will help drive future innovation. We invite the community to contribute to
TiBench in order to support the full spectrum of choices in implementing machine learning systems.
CCS Concepts: •Computingmethodologies→Neural networks;Model development and analysis; •Hardware
→ Analysis and design of emerging devices and systems;
Additional Key Words and Phrases: Neural networks, accelerators, benchmarks, heterogeneous hardware
ACM Reference format:
Michaela Blo, Lisa Halder, Miriam Leeser, and Linda Doyle. 2016. TiBench: Benchmarking Neural
Networks on Heterogeneous Hardware. ACM J. Emerg. Technol. Comput. Syst. 1, 1, Article 1 (September 2016),
34 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 ACM. 1550-4832/2016/9-ART1 $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
ar
X
iv
:1
90
9.
05
00
9v
2 
 [c
s.A
R]
  1
7 N
ov
 20
19
1:2 M. Blo et al.
1 INTRODUCTION
Over the last several years, neural networks (NNs)1 have become incredibly successful. A huge
variety of neural networks are increasingly deployed in conjunction with robotics, advanced driver
assistance systems (ADAS), security monitors and many other applications. Furthermore, as they
have the theoretical property of being a universal approximator which requires zero domain
expertise, they are increasingly applied to previously unsolved problems, and sometimes to replace
existing algorithms, unless of course the original algorithm is of much lower complexity. Note that
the applications listed above are all embedded applications, and there is an increasing interest in
training as well as inference in such environments.
e challenge of deploying these networks lies in their compute and memory intensity, which
poses the largest barrier to adoption particularly within the embedded space where compute re-
sources, power and memory are at premium. Inference requires oen billions of operations and
training for modern algorithms involves tens of single-precision exaops to converge and has
tens of millions of parameters [4]. e interest to apply these techniques in energy constrained
environments has spawned a rise in algorithmic and architectural innovation. Algorithmic opti-
mizations include topological transformations with pruning and compression schemes. In addition,
the general trend towards transprecision computing [51, 78] can be nicely exploited within this
particular application context. Extreme reduced precision neural networks for example, which
take datatypes down to ternary or even binary representations can bring signicant hardware cost
savings and minimal accuracy impact, as visualized in Fig. 1[8].
Fig. 1. Accuracy-Hardware Cost Tradeos
Architectural innovation is show-
cased by Google’s TPU [41], numer-
ous start-up companies such as Ner-
vana, Graphcore, GROC, and Cere-
bras, as well as a spectrum of re-
congurable accelerators leveraging
FPGAs. Each of these architectures
brings their own inherent benet.
Overall, it is becoming increasingly
dicult to predict which architec-
ture will deliver what performance
for which particular neural network.
is poses the key challenge that we
address with our benchmark suite.
Benchmarks at their core encom-
pass a suite of tests for evaluating performance or level of quality. When done well, benchmarking
creates clarity by establishing fair baselines and providing representative comparisons between
dierent platforms and compute fabrics. ey act as the antidote to product marketing and pro-
vide system designers a toolbox to avoid making poor choices where end systems fail to meet
requirements such as throughput, power or cost, and delay product launch. e benets of a good
benchmarking suite go beyond this and provide insights from all perspectives. Benchmarks can be
of high benet to hardware designers as well as end users. Benchmarks drive optimizations for
semiconductor companies who are customizing compute fabrics for deep learning applications,
and for end users standardized tests help drive optimal purchasing choices. Finally, for newcomers
to the domain, benchmarking suites can oer objective summaries that introduce key gures of
merit and basic choices as well as seing expectations of the state of the art.
1We use the terms neural network and model synonymously throughout this article
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:3
is is an extremely complex design space to visualize, as shown in Fig. 2. ere are numerous
machine learning applications, and each of these can be trained with dierent datasets and dierent
neural network models and variations, and depending on these factors (as well as numerical
representations, learning techniques and hyperparameter selection) can produce dierent results,
the key gure of merit being test error rate or conversely, accuracy. ere are numerous choices
with dierent hardware platforms within the cloud and IoT spaces and everywhere in between. All
of the implementation alternatives will deliver dierent performance in tera or giga operations per
second (TOP/s or GOP/s), response time, power consumption, cost and required development eort.
Within this space there are two main types of benchmarks: Machine Learning (ML) benchmarks
and performance benchmarks. ML benchmarks are typically aimed at achieving low test error,
independent of the hardware implications, therefore being of limited eciency. Examples are the
ILSVRC ImageNet competition, as well as more sophisticated eorts such as MLBench [48]. Perfor-
mance benchmarks are agnostic of the target application, measuring performance characteristics
such as throughput and power for characteristic compute paerns. Even when tailored towards
characteristic ML workloads, they do not capture the fact that for dierent hardware architectures,
dierent compute paerns should be used. Most importantly, they do not correlate their results
regarding algorithmic optimization back to the application level target, which is accuracy, and
therefore provide the necessary freedom and scope for algorithmic modications, an essential
ingredient to extracting performance out of heterogeneous computing systems.
In this paper we present TiBench, a benchmarking suite that lies at the intersection of the
machine learning and hardware communities and spans the full design space. TiBench couples
neural network performance with hardware performance and as such can provide insights as to
what is the best possible combination within this design space for specic use cases. Although
there are a number of eorts emerging in this space, such as DeepBench and MLPerf, there is
currently no comprehensive benchmarking suite in existence that addresses the scope of what is
needed, and in particular targets embedded systems. TiBench is unique in the way we support
quantization () which is an important optimization technique for neural networks and leveraged
by many specialized hardware architectures. Furthermore, TiBench provides multiple tiers of
tests (Ti) which can provide deep insights for the composition of complex systems and provide
tradeos between speed and accuracy across a broad range of systems.
e main contribution of this paper is the denition of TiBench, which has the following
unique features:
• It is a multi-tiered approach that supports a range of compromises for benchmarking in regards
to quality of prediction and eort. In particular, TiBench supports theoretical results as a
measuring stick, dierent computational paerns for dierent neural networks, and combinations
of microbenchmarks and full applications for addressing the end user design space.
•It supports algorithmic optimizations and levels of development eort including naive and opti-
mized implementations, by correlating everything at the application level’s gures of merit.
• In particular, TiBench supports dierent approaches to quantization at all levels, which is
essential for ecient, low power architectures.
• It supports a broad range of applications, both inference and training, and available systems from
cloud to IoT.
TiBench is still in its early stages. We hope the community will help make this a valuable
contribution to the Machine Learning eld. In this paper we provide the rst analysis of theoretical
compute and memory requirements for both applications and candidate hardware platforms, which
forms level 0 of our benchmark suite. We present initial experimental results to validate the
benchmarking methodology, as well as outline plans for the remaining levels.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:4 M. Blo et al.
Fig. 2. Multidimensional Design Space
Fig. 3. Applications, Datasets, Neural Networks
e remainder of this article is structured as follows: We start with background on neural
networks. Sec. 3 analyses the compute and memory requirements of a broad selection of networks.
Sec. 4 provides details on dierent hardware architectures and how inference and training workloads
can be mapped to them. is provides insights into the spectrum of implementation choices and
how they are represented within the benchmark suite. In Sec. 5 we take a closer look at the key
components, characteristics and challenges of a benchmarking suite in Machine Learning. Sec. 6
describes existing eorts in this space and Sec. 7 introduces the key concepts of TiBench. We
evaluate our approach with experimental results in Sec. 8. Sec. 9 concludes the article and presents
future directions. Full experimental results can be found in the appendix.
2 BACKGROUND ON NEURAL NETWORKS
is eort focuses on neural networks (NNs), a class of machine learning algorithms that forms a
subclass of articial intelligence. With its property of being a universal approximator [18], NNs
increasingly outperform and replace existing algorithms. NNs can also provide automation for
previously unsolved applications, where no algorithms exist. No domain expertise is required, just
suciently large datasets together with a suciently large topology for the network to train for a
given accuracy target. ese factors contribute to NN’s popularity.
e design space (see Fig. 3) is complex. For every application there are many dierent types
of NNs, and new algorithms continue to evolve. Furthermore, dierent types of datasets can be
used. e resulting combinations can achieve dierent accuracy targets, and are accompanied by
dierent compute requirements. Also, a neural network model is always paired with the particular
framework in which it was trained, which can have impact on the accuracy.
ere is a large application space for neural networks (see Table 1) with domains ranging from
vision to natural language processing (NLP) to gaming and recommendation systems. In each
domain, there are numerous tasks which are amenable for neural networks; for example, within
the vision processing context: image classication, object detection, and semantic segmentation.
Furthermore, these models can be trained using dierent training techniques. Note that it is not easy
to dene clear categories as terms overlap. For example, deep reinforcement learning techniques
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:5
Table 1. Breadth of popular ML Tasks and NN Types
Application NN Types Compute Type
Learning Technique Domain Task Models
Supervised Vision Image Classication MLPs, ResNet, VGG, AlexNet, InceptionV3 FC, CNV
Object Detection Faster R-CNN, Yolo9000, Yolov2 FC, CNV
Semantic Segmentation Mask R-CNN, SSD FC, CNV
NLP Machine Translation Transformer, Seq2Seq FC, CNV, recurrent
Speech Recognition DeepSpeech2 FC, CNV, recurrent
Sentiment Analysis Seq-CNN FC, CNV, recurrent
Language Modeling Memory Networks memory network
Recommendation Movies NCF …
Unsupervised Vision Feature Extraction Autoencoder FC
Generative Adversarial Learning Vision Image Generation/Modication WGAN NV,DCNV
Deep Reinforcement Learning Game Go MiniGo …
Atari ALE DeepQ, A3C …
can be applied to any network. Seq2Seq networks is a full family of networks, while ResNet50,
VGG, and InceptionV3 refer to specic topologies.
Table 1 shows the pool of candidate neural networks that we plan to use as part of our benchmark,
including both inference and training. While there is a large breadth of neural networks, there
are many common layer types being used, which are ideal to form levels 1 and 2 of TiBench.
ese layer types equate to the basic computational paerns and are based on previous analysis [1].
e most popular compute layers are fully connected, convolutional, pooling, normalization and
recurrent layers. ese come with very dierent compute and memory requirements and are briey
discussed here. A more detailed description can be found in [75]. Fully connected layers compute
the full cross product between input tensors (for example) and a vector of weights, the laer are
determined during training. Summed to a bias, this is then fed into an activation function. Popular
activation functions include the hyperbolic tangent function and the rectied linear unit (ReLU). In
convolutional layers, the output receives inputs from a small receptive eld of the previous layer.
is approach greatly reduces the number of parameters (or weights) involved and allows local
features (e.g., edges, corners) to be found [47]. A basic 2D convolutional layer is similar to a fully
connected layer except that: a) each neuron receives an image as input and produces an image as
its output (instead of a scalar); b) each synapse learns a small array of weights which is the size
of the convolutional window; and c) each pixel in the output image is created by the sum of the
convolutions between all synapse weights and the corresponding images. Recurrent layers are
characterized by the fact that they contain state over a sequence of input data. ere are many
dierent options for the implementation of the recurrence within the layer, starting from simple
recurrent layers, to GRUs or LSTM layers, which can be uni- or bidirectional, feature dierent
numbers of feedback gates, and may include numerous specializations such as peepholes and CRCs.
Beyond, these basic layer types, there are many layer combinations emerging, such as inception
layers in GoogleNet[76, 77], residual layers in ResNet models [36], and so-called re modules [38].
During training using backpropagation with stochistic gradient descent, we need to compute the
relative derivative to all inputs for these layers. is works out to be similar in compute paerns to
inference with transposed versions of the inputs whereby signicantly larger amount of compute
and memory is required [75]. However, additional compute such as batch normalization needs to
be addressed.
2.1 Optimization Techniques
As mentioned in the introduction, the challenge lies within the compute and memory requirements
which can oen preclude inference deployment within the IoT context. To alleviate the computa-
tional burden and maximize performance, many optimization techniques have been introduced.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:6 M. Blo et al.
Particularly successful techniques include pruning, compression, low rank approximations and
quantization [33]. We discuss quantization, a specic focus of this work, and pruning in more detail
below. All of these techniques fall under the category of algorithmic optimizations. A representative
benchmark supports and measures these, as they are essential for viable deployment solutions.
antization & Numerical Representations Transprecision computing is making strides
in many application domains [51, 78], and is highly eective for neural network inference. In
particular, quantization to reduced precision datatypes, including 8 bit xed point integer and below,
as well as custom oating point formats. For example, quantized neural networks (QNNs) have
been shown to work extremely well. On smaller image classication benchmarks such as MNIST,
SVHN and CIFAR-10, QNNs achieve state of the art accuracy despite reduction in precision [17, 93],
even for partial or full binarization of fully connected and convolutional layers. XNOR-Net [67]
applies convolutional BNNs on the ImageNet dataset with topologies inspired by AlexNet, ResNet
and GoogLeNet, report top-1 accuracies of up to 51.2% for full binarization and 65.5% for partial
binarization, while for the more challenging ImageNet benchmark, there is a small but noticable
accuracy drop. e resulting solution can run signicantly faster in hardware and might still pose
an aractive design trade-o. Furthermore, there is signicant evidence that increasing network
layer size can recuperate this drop in accuracy [27, 44, 56, 74, 91].
Table 2. Latest Accuracy of QNNs
Network oat top-1(top-5) QNN top-1(top-5)
GoogLeNet 71.4% (90.5%) 63.0% (84.9%)
VGG-like 69.8% (89.3%) 64.1% (85.6%)
ResNet-50 [3, 94] 79.26 (94.75%) 64.6% (85.9%)
ResNet-50 [96] 64.6% (87.8%)
New quantization schemes
show promising results using
for example Half-wave Gaussian
antization (HWGQ) [10] to
take advantage of the Gaussian-
like distribution of batch normal-
ized activations. Furthermore,
new training and optimization
techniques [55, 96] work eec-
tively. e current lowest error
rates for ImageNet classication have been achieved using ternarization [3, 94] as shown in Table
2. antization has been successfully applied to other tasks including 3D object recognition, facial
expression recognition [50, 73], optical character recognition as well as speech [31, 49, 70]. Even
in training, research shows that 32bits are not really needed given the typical value ranges for
weight and activation gradients and weight updates involved. Fixed point integers, half precision
oating point (FP16), boat16, expoint or block oating point representations show state-of-the-art
performance [30, 45, 54, 89]. All of these need to be accurately reected within the tests.
Pruning is is another popular optimization which has been shown to dramatically reduce
memory requirements, through either synaptic pruning or lter pruning. When synaptic pruning is
leveraged, irregular compute paerns result which impact memory access eciency, thus hardware
architectures require support for sparse matrix representations to benet from this [31]. Filter
pruning yields regular compute paerns and benets thereby a broader selection of platforms [33].
3 NEURAL NETWORKS AND THEIR COMPUTE AND MEMORY REQUIREMENTS
We analyze neural networks with regards to their arithmetic compute, intermediate storage require-
ment and memory footprint. While actual hardware requirements depend on numerous aributes,
at this point we are characterizing the theoretical requirements in an architecturally independent
way. For example, actual on-chip requirements and external memory requirements depend on
implementation choices, but can be derived directly, so this analysis is useful to categorize the
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:7
Fig. 4. Compute, Buer and Storage Elements
dierent requirements. e scope of the analysis is currently constrained to the models shown in
Figures 5; e planned scope is listed in the appendix.
Inference Each NN layer (L0, L1, etc.) requires a specic number of arithmetic operations
OL0,OL1,OL2 in the form of multiplies, additions etc. We measure these in giga or tera operations
respectively (GOPs, TOPs). e overall compute of a network with n layers, Ototal, is the sum of the
compute in each individual layer (see eq. 1). We dene the total modelsize Wtotal as the sum of the
weight requirements per layer measured in millions of elements (ME); this is independent of any
choice in numerical representation. e real memory footprint can be derived by multiplying with
the size of the given datatype (for example 32b for single precision oating point). We quantify the
intermediate buer requirement Ttotal in an implementation neutral fashion. For this we calculate
the sum of the required amount of tensors Ti that precede each layer. ese are derived as the
product of feature map dimensions (wi, hi) and number of channels (i). Note that all of this
applies to non-linear topologies such as DenseNet [37]; however, our models currently do not
reect graph connectivity. We plan to address this in the future.
Ototal =
n−1∑
i=0
Oi , Wtotal =
n−1∑
i=0
Wi , Ttotal =
n−1∑
i=0
Ti ,Ti = wi × hi × chi (1)
TrainingWhile training is currently the focus in the cloud, we expect that it will become essential
in embedded as well as on-line learning takes o. In regards to requirements, we need to consider
backpropagation in addition to inference. As depicted in Figure 4, training requires additional
data structures. First of all, symmetrically to the tensors Ti, we need to buer their gradients
TGi. Furthermore, so-called weight gradients need to be stored WGi which are the derivative (in
relation to the input weights) of the gradient TGi + 1. Depending on given optimization strategies,
weight updates need to be buered as well. is results in roughly 3 times the buer requirements
for weights, and double the amount for tensors. Regarding compute, backpropagation requires
roughly 3 times the inference compute for a single image of the training data set (plus 1 update
operation per weight parameter). Overall compute needs to be multiplied with number of iterations
and number of inputs in the training data set. Note that data dependencies are signicantly more
intricate and challenging for training. is is currently not reected within the theoretical analysis.
Summary of Requirements Figure 5 visualize initial results, where for Seq2Seq models, we
assume a sequence length of 3000 (based on the LSTM test case in DeepBench [20]). e key
observations are as follows: First, the compute and memory requirements are on average very
high. Mean model size is too big to t into most on-chip low latency memory (with 71.14MBytes),
and compute is in the GOPs range for every single input datum. Second, there is a signicant
variation in all requirements for both training and inference as summarized in Table 3. No simple
generalizations can be made, even within subcategories such as image recognition, as models
vary greatly depending on size and complexity of images, number of objects to be recognized,
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:8 M. Blo et al.
Fig. 5. Compute Requirements Training and Inference for a spectrum of NNs (Visualization of TiBench
Level 0 - CNN Statistics)
Table 3. Ranges and Mean Requirements
Inference Training
OItotal [GOPs] Wtotal [MBytes] Ttotal [MBytes] OTtotal [GOPs] WUtotal [MBytes] TGtotal [MBytes]
Min 0.00 0.00 0.13 0.00 0.27 0.00
Max 412.17 71.14 138.34 1236.64 276.69 71.14
Mean 62.59 11.9 38.02 187.79 76.05 11.9
Assuming 8b datatypes for inference and 32b for training.
etc. e dened parameters: Ototal, Wtotal, Ttotal, OTtotal, WUtotal, and TGtotal help describe the
compute requirement for inference and training of each individual network and can be used for
baseline computations, taking architectural constraints into consideration, and cross-correlated
with rooine models to provide rough performance guidance.
4 HARDWARE ARCHITECTURES FOR DEEP LEARNING
We discuss target hardware systems, their architectures and implementation alternatives. While we
present details on cloud platforms, the focus of this article is on embedded systems. ere is a huge
range in the types of hardware architectures used for machine learning applications, including
CPUs, GPUs, FPGAs and specialized architectures. e eld has spawned signicant new research
in computer architecture and created so-called deep learning processing units (DPUs), which are
specialized for this application domain and can be implemented either with ASICs or in FPGAs.
Architectures can broadly be classied by the basic type of compute operation, memory bandwidth,
level of parallelism, degree of specialization and inherent precision support. CPUs are widely
used for ML applications, and are viewed as serial compute engines, optimized for single thread
performance, with implicitly managed memory hierarchies (including three levels of caches), and
support oating point operations. GPUs are vector processors that support smaller oating point
formats (FP16) natively, most recently xed point 8bit integer formats, and have a mix of implicitly
and explicitly managed memory. DPUs, such as Google’s Tensor Processing Unit (TPU), work with
tensors, have explicitly managed and specialized memory hierarchies and support integer operations.
With newer generations, the boundaries between dierent hardware architectures are blurring.
CPUs are usually multicore to support parallel processing, and incorporate vector processing units,
GPUs are adding tensor processing units, and the TPU now supports oating point operations.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:9
Table 4. Hardware Architectures for Cloud Systems with Theoretical Performance (TiBench Level 0 -
Hardware Platform Statistics)
Platform Num. Choice roughput [TOPs] Mem BW [GBps] Power [Watt] Performance/Power [TOPs/Watt]
GPUs
NVIDIA V100 [22] FP32 14 250 300 0.06
NVIDIA V100 FP16 112 250 300 0.45
NVIDIA P100 [61] FP32 8 732
NVIDIA P100 FP16 16 732
NVIDIA P40 [64] INT8 47 200 346 0.24
NVIDIA P4 INT8 22 60 192 0.37
AMD Vega10 [87] FP32 13.7 484 345 0.04
TPUs
Google TPUv1 [71] INT8 92 75 34 1.23
Google TPUv2 [41] FP16 45 600
Google TPUv3 [80] FP16 90
ASIC DPU
Graphcore Custom 224 300 0.75
Groq unknown 400 8
Nervana custom16 55
Wavecomputing 1DPU INT8 181 271 0.7
FPGA DPU
Xilinx VU9P 2b/8b 93.00 88 100 1.06
Xilinx VU9P 2b/4b 139.88 88 100 1.59
Xilinx VU9P 2b/2b 192.52 88 100 2.19
Microso Brainwave Stratix X [14] FP8 90 125 0.72
Table 5. Low Power Hardware Architectures and Theoretical Performance (TiBench Level 0 - Hardware
Platform Statistics)
Platform Num. Choice roughput [TOPs] Mem BW [GBps] Power [Watt] Performance/Power [TOPs/Watt]
CPUs
Bitserial Cortex-A57 on Jetson TX1 [85] BIN 0.09 0.019
GPUs
NVIDIA TX2 (MaxP) [26] FP32 .575 59.7 15.0 0.038
NVIDIA TX2 (MaxP) [26] FP16 1.15 59.7 15.0 0.077
ASIC DPU
Movidius Myriad 2 [7] INT8 .15 1.2 0.125
Movidius Myriad X [60] INT8 1 1 1
Kalray MPPA Turbocard3 [43] FP32 1.6 110 0.014
BinarEye [58] BIN 0.09 - 2.8? 230†
BNN Custom Fabric [5] BIN 1.4 0.6 2.3
Stripes Bitserial ASIC [42] BIN 128.5 4.3
IBM AI Accelerator [39]2 BIN 12
Eyeriss [13] INT16 0.084 1.17 †
ARM ML Processor [6] unknown 4.6 3
DianNao [12] INT16 0.452 120 0.485 0.93
EIE(28nm) [32] INT4 3 (0.102 sparse) 2.36 1.27 2.4 (0.08 sparse)
Cambricon-X [92] INT16 0.544
FPGA DPU
Laice SenseAI [46] BIN 1.4 0.6 2.3
Bismo biserial on PYNQ [86] BIN 6.5 4.64 1.4
FINN on ZC706 [84] BIN 11.6 0.408
ZCU104 (Deephi-666MHz) INT8 4.60 19.2
ZCU104 (eoretical-775MHz) INT8 5.36 19.2
GX1150 on HARPv2 [59] BIN 0.041 0.85
Measured ?
Chip level power consumption only †
FPGAs can support any of the above congurations with explicitly managed memory. FPGAs are the
most exible of all target hardware, and can be congured to support any numeric representation,
even bit-serial hardware architectures which provide run-time congurable precision. Custom
ASIC implementations, which minimize hardware cost and maximize performance, have emerged
to exploit specic precision arithmetic and customized memory systems. Tables 4 and 5 list many
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:10 M. Blo et al.
of these hardware targets along with published performance numbers. 3 One of the goals of
TiBench is to provide a more systematic way to compare performance and accuracy between
these systems, rather than relying on vendor reported metrics.
NVIDIA GPUs are some of the most popular hardware targets for machine learning, and newer
families of chips have been introduced to specically accelerate this task. For example, the Volta
architecture, introduced in 2018, was particularly designed to accelerate AI and incorporates tensor
cores as a new feature, as well as improved FP32 and FP64 support for training in a data center
seing [22]. AMD announced the Vega GPU [24] with new deep learning instruction set operations,
with the goal of obtaining parity with NVIDIA’s high-end Tesla V100 datacenter GPUs. Both
companies have low power GPUs: the AMD Vega mobile GPU [34] and NVIDIA Jetson TX2 [26].
Google introduced its TPU in 2016 [71], which was designed to accelerate Google’s TensorFlow
framework. e rst generation supported integer arithmetic with a massively parallel 8-bit matix
multiply engine. e second generation TPU was anounced in May 2017 [41], and the third
generation in May 2018 [80]. ese newer chips boast improved memory performance as well as
support for oating point specically aimed at training.
ere are a number of startups introducing custom hardware in this space. Within the cloud
space, there are Graphcore, Cerebras, Groq, and Wave Computing. Within the embedded space,
where the design constraints are even more stringent, we nd even more, as are listed in table 5.
Most are secretive about the details of their designs, and this landscape is rapidly changing. Intel is
investigating several custom accelerators including Nervana and Movidius. Fathom [7] is Movidius’
ultra low power Neural Compute Stick which operates at about 1 Wa. At the extreme, binarized
neural networks which are very high throughput at extremely low power, are exploited in the
following ASICs: BinarEye [58], BNN Custom Fabric [5], Stripes Bitserial ASIC [42], and IBM AI
Accelerator [39]. Others exploit sparse computing engines, such as EIE and its successor ESE [31],
SCNN [66], Cnvlutin [2], Cambricon-S and Cambricon-X [92].
FPGAs are an extremely popular platform for machine learning. As they are highly exible and
can be used in a variety of dierent congurations and support any arithmetic format, they can be
fully customized towards specic neural network topologies, thereby achieving high performance
and eciency. However, for the same reason, they are extremely dicult to characterize in general.
FPGAs are available in the cloud, such as the Xilinx Ultrascale+ VU9P available as part of the
public Amazon Web Services (AWS) cloud infrastructure. Within the embedded space, we have
pioneered the rst binarized neural network accelerators [27, 84] and provided many proof points
for customized reduced precision implementations [8]. Umuroglu et al. [86] demonstrates that
run-time programmable precision can be achieved with a bitserial approach, providing highly
aractive performance on FPGAs, with lile overhead. Intel FPGAs have also been successfully
applied to machine learning applications using a range of dierent numerical representations [63].
e Microso Brainwave project [14] aims at applying FPGAs at datacenter scale using their own
custom oating point representation. Focusing on the IoT market, Laice has announced binarized
neural network libraries targeing low power FPGAs and achieving 1TOPS/Wa [46].
5 CHARACTERISTICS & CHALLENGES IN BENCHMARKING
5.1 Key Components of a Benchmark
A benchmark can be dened as a set of standards used for evaluating performance or level of
quality. A more practical denition implies that the “set of standards” is supplied in the form of
a well-dened set of executable tests and measured regarding a specic set of gures of merit.
Sometimes additional items are included such as performance analysis or proling tools which
3ese tables form part of level 0 of our benchmark suite and can be used as a basis for performance estimation.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:11
can help shed light on system bolenecks. Test infrastructure or a testbed can be provided to
ensure reproducibility. is makes particular sense when specialized and not easily available
hardware systems are involved. Data management can be handled together with the benchmark
suite and stored in an accessible location as for example with DAWNbench [16], MIT’s Eyeriss
project [25] and the Request tournaments online score card [68]. In this article we dierentiate
proling tools, test infrastructure, and measurements from the actual benchmark test suite (see
Fig. 6). Somewhat related to benchmarking are modelzoos, such as OpenAI Gym [9] and rllab [21],
which are selections of sample code. ey are not necessarily aiming to be representative, and
typically include simplied implementations to teach concepts. TiBench focuses initially on the
benchmark suite and measurements.
5.2 Characteristics
Fig. 6. Benchmarking collateral
Benchmarking can bring many in-
sights. For end-users and system de-
signers, it helps to estimate expected
system-level performance and pro-
vides an understanding of what al-
gorithms work best on which hard-
ware platform. For hardware design-
ers, benchmarks provide design per-
spectives and clear cut guidelines re-
garding what gures of merit maer and what workloads look like. Neural networks are pushing
the limits of what is possible, therefore careful system level co-design of hardware and algorithms,
and realistic expectations of what is achievable given the design choices using benchmarking, are
crucial. To bring maximum benet, the following characteristics are essential which are discussed
in greater detail below:
• representative of common workloads
• supportive of algorithmic modications
• objective and reproducible
• portable to heterogeneous hardware systems
• complexity vs accuracy tradeo
• adaptive “living” benchmark supported by industry and academia
Representative Benchmarks need to be representative of real world workloads. In machine
learning, this requires breadth across a spectrum of applications, algorithms and computational
paerns. Computational paerns are important to maximize insights into dierent hardware
architectures. Application coverage is essential as it provides more holistic insights into system
level performance which can be hard to predict given the emerging complexity of increasingly
heterogeneous hardware systems.
Support for algorithmicmodicationAlgorithmic modications are inevitable to extract best
possible performance out of diverse hardware systems, for example to take advantage of caching
and parallel hardware resources. Within machine learning, soware and hardware co-design
are compulsory [29] for energy constrained compute environments. To support this algorithmic
freedom within the benchmark suite, application coverage is essential, as we correlate hardware
performance independent of the algorithm back to application performance, which is equivalent
to accuracy in this context. However, optimized performance alone is not sucient, as not every
system designer may be able to achieve it. We also need to reect the out-of-the-box, naive
performance. Both optimized and naive are representative of a specic hardware platform, and the
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:12 M. Blo et al.
dierence gives a good indication of the development eort involved. We believe both should be part
of the benchmarks and be captured together with development time or lines of code. Specically
for neural networks, quantization, compression, topological changes and pruning techniques are
important optimization techniques that need to be considered.
Objective & Reproducible To provide clear dierentiation between marketing and scientic
eorts, reproducible and objective results that do not favour any particular system conguration or
hardware architecture are needed. Reproducible results are a key ingredient in the move towards
Open Science, however, what does reproducibility actually entail? In the context of the plethora
of esoteric AI accelerators, is it sucient that an objective third party has validated the results?
Or does it imply that everyone on the planet should be in a position to reproduce the results if
they had access to the system at a reasonable cost? Some hardware systems are too expensive;
for example, a NVIDIA V100 may be beyond someone’s budget. Other hardware choices are only
available for rent, such as Google’s TPU versions as part of Google cloud.
Portability is a challenging subject as specialized hardware architectures come with their own
design entry languages and compiler tool stacks. e community is fragmented by a huge choice
of frameworks including Cae, Tensorow, Mxnet, eano, pytorch and Darknet. What is more,
the prediction accuracy of a network depends on the choice of framework, since training data is
passed through dierent preprocessing stages and numerical inaccuracies accumulate and manifest
themselves as discrepancies. ese inaccuracies are exacerbated by the characteristics of oating
point arithmetic [28]. As a result, models and frameworks are inherently tied together. ere
are three basic choices: e rst is to constrain ourselves to exactly one framework as was done
with Fathom [1]. Second, we could support all frameworks. However, given that we are dealing
with dierent hardware backends, this causes an explosion in test infrastructure, as the number of
tests multiplies with the number of frameworks. e nal choice and probably the cleanest, is to
support one of the intermediate neural network representations such as ONNX [65], NNEF [62]
or TVM [83], which provide translation between all popular frameworks. However, this requires
hardware vendor support, which is currently limited.
Complexity vs Speed vs Accuracy Speed of result is essential, as the key purpose of a bench-
mark is to provide faster insights than developing the full end-system. ere is a trade-o between
speed, benchmark complexity and the accuracy of the results. Benchmarks which provide applica-
tion and algorithmic breadth may require a large number of tests thus making the benchmark suite
inherently complex and limit the usefulness of the benchmark. Sometimes it is important to have
less accurate predictions at a faster rate, and, for dierent users, dierent tradeos are acceptable.
Adaptive As machine learning is a highly active research eld where algorithms change fast,
the benchmark suite should be adaptive and able to incorporate emerging popular algorithms,
compute paerns and end applications.
6 RELATEDWORK: EXISTING BENCHMARKING
In this section we take a look at existing benchmarks, and compare them regarding algorithmic
scope and gures of merit. TiBench diers from these eorts in a number of ways:
• Existing benchmarks do not address the fact that heterogeneous hardware platforms typically
require co-designed algorithms, and oer exibility in precision for datatypes specically, although
MLPerf has open models for training. We introduce correlation of application and architecture
gures of merit to compare dierent combinations of algorithms and architectures at the application
level. •We oer full visualization of the design space, rather than comparing performance for xed
levels of accuracy. us, interesting trade-os can be highlighted. •None of the existing benchmarks
oer the some level of tiering, including theoretical level, and stacks of microbenchmarks that can
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:13
Fig. 7. Categories of Benchmarks and corresponding Figures or Merit
help isolate problematic data movement paerns and tensor dimensionalities. • Finally, there is a
dierence in scope. Most benchmarks currently focus foremost on training.
In the following, we expand and elaborate on the dierences in greater detail. For this, we dier-
entiate between ML benchmarks, performance benchmarks and NN system benchmarks.
ML benchmarks exclusively focus on application performance, which is accuracy. ere is no
consideration of compute eort required or resulting execution time. Performance benchmarks
record hardware performance only, specically throughput (measured in processed inputs per
time or TOP/s), latency or response time in milliseconds (ms), and power consumption in Was.
Performance benchmarks only look at hardware performance and are agnostic of the application.
NN system benchmarks, as shown in Figure 7 lie at the intersection and are at the heart of what
we are striving for. ey combine all gures of merit; both system performance and accuracy
are correlated. In addition, functional correctness even during performance testing needs to be
ensured.
6.1 NN System Benchmarks
TiBench falls into this family of benchmarking suites which are unique in that they combine
representative machine learning workloads with gures of merit from hardware performance
benchmarks. BenchIP [79] is a benchmarking suite which has a broad set of machine learning tasks.
Similar to TiBench, BenchIP adopts a multi-tiered approach with micro- and macro-benchmarks.
However BenchIP does not support the theoretical layer, which we use to cover compute eciency
and track benchmarking results. BenchIP also doesn’t cover level 2, namely stacks of layers, which
we believe bring great merit in isolating bolenecks in data movement and highlighting problematic
dimensionality in tensors. Finally BenchIP does not oer the concept of comparison via pareto
curves which is essential to a) visualize the full scope of potential solutions within the design
spectrum, and b) provide the necessary scope for algorithm optimizations matching the specics of
various accelerators. Fathom [1] is probably the rst aempt to provide a representative workload
for benchmarking that has algorithmic breadth beyond convolution neural networks inference
and includes example training and unsupervised learning such as reinforcement learning and
recurrent models. However, Fathom does not address the spectrum of numerical representations.
It also does not support heterogeneous hardware platforms. In regards to framework strategy,
Fathom advocates a unied soware package, relying on compatible soware stacks to emerge,
and therefore only supports one framework, TensorFlow. With a primary focus on benchmarking
for training and achieving application coverage rather than algorithmic breadth, TBD [95] adopts
some of the concepts introduced in Fathom. It supports more frameworks and datasets and covers a
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:14 M. Blo et al.
range of applications, including image classication, machine translation, object detection, speech
recognition, adversarial and deep reinforcement learning. MLPerf [57] is a promising approach at
providing system level benchmarks. Similarly to Fathom and TBD, it covers a representative range
of applications adding sentiment analysis and recommendation as target applications. It currently
considers only training but inference is in process. MLPerf is created by a consortium of industry
partners and universities, which should address objectivity criteria. Its key strengths are explicitly
dening gures of merit and its strong industrial support. It provides the concept of open models,
which allow for algorithmic optimizations that facilitate performance improvements for specic
architectures. However, it does not explicitly support quantization.
DAWNBench [16] exclusively looks at ImageNet classication for training and inference. e
benchmark sets very clear gures of merit such as “Time taken to train an image classication
model to a top-5 test accuracy of 93% or greater” and “Latency required to classify one ImageNet
image using a model with a top-5 test accuracy of 93% or greater” and as such supports the concept
of algorithmic optimizations by tying hardware performance to accuracy achieved at the application
level but falls short of visualizing the full design space. Finally DAWNBench does not provide
further insights beyond the specied gures of merit, and is limited in application scope.
e Collective Knowledge Framework [15] in conjunction with the ASPLOS Request Tourna-
ment [68], while narrow in scope (limited to ImageNet Classication inference), opens up the design
space for dierent hardware accelerators, facilitating architecture specic algorithmic transforma-
tions and correlation between accuracy and performance and power within a larger design space.
is is essential to support heterogeneous hardware architectures. ASPLOS excels in reproducibility,
leveraging ACM artifact evaluation technology, and providing insight into hardware performance
and error rate trade-os, through an online scorecard.
6.2 ML Benchmarks
e Machine Learning community has dened its own benchmarks which have an exclusive
focus on achieved accuracy independent of the required compute, employing ensemble techniques
and multi-crop which in essence, linearly scale up the compute load per input data. e most
popular of these is the ImageNet Large Scale Visual Recognition (ILSVR) Challenge [69]. e
associated compute requirements are unrealistic, particularly when deployed in energy-constrained
environments. CortexSuite [81] and BenchNN [11] are limited to measuring accuracy, where
CortexSuite is constraint to perception and cognition while BenchNN shows the value of machine
learning for approximate computing, based on 5 out of the 12 recognition, mining and synthesis
applications from the PARSEC benchmark suite. DjiNN and Tonic [35] focuses on deep learning
tasks for warehouse scale computers including image, speech processing and natural language
processing. While kaggle(www.kaggle.com) isn’t specically a benchmark, it hosts a portfolio of
data science challenges where the machine learning community competes with the latest topologies
and algorithms for highest accuracy. MLBench [48] compares human derived learning algorithms
against machine learning services from Amazon and Microso Azur.
6.3 Performance Benchmarks
DeepBench [20] is probably the most successful suite of microbenchmarks for neural network
performance that measures and compares basic compute operations. It benchmarks individually
direct convolutions, matrix multiply, and a specic LSTM layer for single precision, half precision
oating point and for some operations 8b xed point integer datatypes on hardware architectures.
It currently features cloud deployment and some embedded data points on raspberry pi and
iphone. It captures the most popular compute paerns, however lacks support for lower precision
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:15
Table 6. Benchmarks, Applications, Datasets and Models
Application MLPerf Fathom TBD BenchIP
Domain - Task Dataset - Model Dataset - Model Dataset - Model Dataset - Model
Supervised Learning
Vision - Image Classication ImageNet - ResNet ImageNet - ResNet ImageNet1k - ResNet50 ImageNet - ResNet
ImageNet - VGG, AlexNet ImageNet1k - InceptionV3 ImageNet - VGG, AlexNet
Vision - Image Classication MNIST - LeNet-5
Vision - Object Detection COCO - Pascal VOC 2007 - Faster R-CNN Pascal VOC 2012 - Faster R-CNN
Vision - Semantic Segmentation Mask R-CNN - - Pascal VOC 2012 - DeconvNet
Vision - Image Captioning - - - Visual Gnome - FCLN
Vision - Video Captioning - - - MSVD - S2VT
Vision - Face Recognition - - - LFW - Deep Face Recog
NLP - Machine Translation WMT Eng-German - Transformer WMT-15 - Seq2Seq IWSLT15 - Seq2Seq English WSJ - SyntxNet
NLP - Machine Translation IWSLT15 - Transformer
NLP - Speech Recognition Librispeech - DeepSpeech2 TIMIT - DeepSpeech Librispeech - DeepSpeech2 RNN - WSJ
NLP - Sentiment Analysis IMDB - Seq-CNN - - -
NLP - Language Modeling - babI - Memory Networks - -
Recommendation - Movies MovieLens-20M - NCF - -
Unsupervised Learning
Vision - Feature Extraction - MNIST - Autoencoder - -
Vision - Adversarial Learning - - Downsampled ImageNet - WGAN -
Recommendation - - - -
Deep Reinforcement Learning
Game - Go Go - Mini-Go
Learning - Atari ALE Atari ALE - Deep Q Atari2000 - A3C
datatypes, and exclusively investigates performance. As such it does not provide the mechanisms
to tie algorithmic modications back to the application level, nor provide insights into compute
performance for reduced precision representations. DeepBench also doesn’t cover data movement
bolenecks between layers, as well as potential bolenecks around buering state, as required for
LSTMs for example, where capacity and access latency crucially impact overall speed.
ere are more general, machine learning agnostic, hardware benchmarks such as TPC [82]
for the data processing community, SHOC [19], SPEC [72] and STREAM [52]. SHOC looks
specically at how to benchmark heterogeneous hardware systems using OpenCL as design entry.
Similar to TiBench, SHOC deploys microbenchmarks combined with application benchmarks
and is multi-tiered. SPEC includes a broad range of applications including graphics, MPI, mail
servers, virtualization, and storage, and STREAM exclusively focuses on memory bandwidth. None
are specically designed for machine learning, and address the challenges of this application
domain. gemmlowp [23], while it is not a benchmark, is specically designed for matrix multiply
operations; it includes low precision operations which may be suitable as a basis for implementation
of part of our benchmark suite.
Summary Overall, support for algorithmic optimization is limited across the whole spectrum
of benchmarks, in particular in regards to quantization and pruning. None of the benchmarks
above provide a multi-tiered approach in the same way we do. ese can provide understanding
of compute and data movement bolenecks within the system, or oer theoretical levels with
eciency tracking. None of the benchmarks oer a fair comparison for co-design algorithms and
full design space visualization. In Tables 6 and 7, we summarize the application scope of existing
and our proposed benchmark, as well as the key dierentiators between existing benchmarks and
our proposal and discuss in Sec. 7 how we address these characteristics.
7 THE BENCHMARK PROPOSAL
e targeted design space is vast and compromised of a multidimensional spectrum of algorithmic
and architectural co-designed end solutions. e aim of the benchmark is to expose the spectrum of
possibilities and accurately reect the capabilities of the dierent hardware platforms. TiBench
has the following key characteristics: We take a multi-tiered approach which is one of our key
contributions (Fig. 8). We tier the benchmark suite with respect to abstraction levels as well as
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:16 M. Blo et al.
Table 7. Feature Comparison of Existing Benchmarks and TiBench
Criteria MLPerf DeepBench DawnBench Fathom TBD BenchIP TiBench
Machine Learning Task
Training yes micro yes micro yes yes planned
Inference planned micro yes micro yes yes (Sec. 8)
Coverage - see Table 6
Applications broad narrow broad broad broad
Compute Paers broad medium narrow broad broad broad broad
Data Movements broad
Support for Algorithmic Optimizations limited limited yes (Sec. 8)
Full Design Space Representation yes yes yes (Sec. 8)
Deployment Target
Cloud yes yes yes yes yes yes planned
Embedded yes yes yes yes (Sec. 8)
Benchmark Abstraction
eoretical yes
Microbenchmarks Compute yes yes yes yes (Sec. 8)
Microbenchmarks Data Movement yes (Sec. 8)
Full Applications yes yes yes yes yes (Sec. 8)
Speed vs Accuracy Tradeo limited yes (Sec. 8)
Bottleneck Insights yes yes (Sec. 8)
Reproducibility yes yes yes yes yes planned planned
Fig. 8. A Multi-Layered Approach with Precision Support
numerical representations for both training and inference tasks. is provides not only aractive
compromises in regards to speed versus minimal discrepancy with target workloads, but also brings
advantages such as additional system level insights.
e second key dierentiator of our approach is the support for algorithmic optimization by
coupling hardware performance with accuracy at the application level. In particular, this allows for
objective comparison between oating point implementations and reduced precision models that
can achieve much higher performance at a signicantly reduced energy cost, among many other
possible optimization strategies. Results are visualized via pareto graphs (accuracy versus latency,
throughput and throughput/power) and optimal solutions can be found along the pareto frontier.
ird, we include a theoretical level as a baseline for benchmarking and performance estimation.
e unique characteristics of TiBench include test suites at various abstraction levels, al-
gorithmic optimizations and quantization, in particular considerations in regards to datasets,
hyperparameters and framework challenges, such as reproducibility and adaptibility (see [53]).
Multiple Tiers - Abstraction Levels We dened 4 levels of abstraction (Fig. 8) discussed below.
Level 0 - eoretical Records for all target hardware backends theoretically possible peak
performance (TOps or GOps), external memory bandwidth (GBps), thermal design power (Was)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:17
and cost ($), and for all models their compute and memory requirements; datapoints are shown
in Sec. 3 and 4. Combining application requirements with hardware platform characteristics
can be leveraged for performance predictions using rooine models [88]. Level 0 is a base layer,
with results that are available instantly, and provide a target point of reference, guidance for
optimization eorts and allows to compute metrics such as achievable compute eciency. At level
0, we already introduce the notion of performance per datatype operation which is essential to
support quantization as an algorithmic optimization.
Two tables are presented in the appendix, one for hardware characteristics and one for neural
networks. e hardware table has one row per hardware platform and supported native datatype;
a minimum of Half Precision (FP16), Single Precision (FP32) and INT8 are recorded 4. In the
second table, for each CNN, we record four values: total number of compute operations for a single
input, the model size, the size of the state and the total amount of tensors in between layers that
require buering. ese values can be used as a basis to derive memory requirements and compute
requirements for both inference and training; examples are shown in Figure 5.
Level 0 - Rooine Analysis Using assumptions for where weights, tensors, gradients, weight
updates and state of a neural network are stored, combined with the size of the datatypes used,
allow us to derive the arithmetic intensity of a neural network during training and inference.
Combined with the rooine for a given hardware platform, we can provide insight as to whether a
neural network will be memory or compute bound and guidance for what is theoretically possible
(Fig. 9) .
Level 1 - Compute Patterns Level 1 exposes achievable compute performance for typical
compute paerns encountered within neural networks, which equates to popular layers including
convolutions, fully connected layers, recurrent layers, residual layers, and squeeze layers, over
a range of dimensions and with dierent numerical representations (Sec. 2). ese tests are
comparable to DeepBench [20], with the signicant dierence that we provide much broader
support for specialized numerical representations. For each of these compute paerns, and for both
inference and training, we record the following gures of merit: measured performance (TOps or
GOps), latency (ms), power consumption (Was) of the full platform in the embedded space, and
of the board excluding the host system in the cloud.5 While level 1 does not capture application
level accuracy, the tests will include verication of functional correctness. e results should
reect achievable compute performance, excluding potential bolenecks for moving data which
are addressed in level 2. While requiring execution, the tests at level 1 are relatively rapid. We
include a sweep over batch and thread sizes.
Level 2 - Compute & Data Movement Level 2 is comprised of simple combinations of level 1
tests, and can thereby eectively capture potential bolenecks such as tensor movement between
layers, as well as storage requirements. It considers stacks of level 1 layers and only includes a
subset of all possible combinations to keep test time to a minimum. We include mixed precision
between layers in these small template stacks for both inference and training. Figures of merit are
identical to level 1. In particular, the latency variation between level 1 with single fused layers and
level 2 with layer stacks will bring insight into data movement and buering bolenecks.
Level 3 -ApplicationsApplication coverage is essential to oer space for algorithmic innovation
which can achieve superior system-level performance and can only be validated when combined
with application results. As such, achieved accuracy becomes the bar for normalizing results, and
independent of the neural network. We include the initially planned datasets and models (Table 6),
taken from existing benchmarks and complement these with models that have been explored to
4If INT8 is not natively supported, it can be embedded inside FP16
5Power measurements might not always be available and might require specialized test infrastructures and testbeds.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:18 M. Blo et al.
work well with pruning and quantization optimizations. Furthermore, contributors are welcome to
provide dierent models for given machine learning tasks. See Appendix for complete list.
For inference, we include performance measurements for a single image. e error rate is the
reported test error over the whole test dataset. For training, we report throughput, training time
(latency), and power for a single image as well (including correctness tests). We also provide
measurements over longer training sequences with specic accuracy targets, for example, measure
complete training time 90% top5 error for ImageNet classication with a ResNet50. Finally, we oer
the option to optimize the training algorithm and network and record all possible data points in a
multi-dimensional graph; for those it is essential to include development time. Similar concepts are
being applied in MLPerf and Request [15, 57]. ere is no single criteria that decides whether one
solution is optimal, as for dierent use cases, dierent gures of merit apply. All combinations yield
dierent trade-os within the multidimensional design space. As such, we present all solutions
and measurements within multi-dimensional gures, whereby the pareto frontier represents the
best possible compromises (Fig. 1).
Algorithmic Optimizations includingantization is benchmarking proposal opens up
the opportunity for algorithmic innovations. We include in this pruning and topological changes,
while initially focusing on quantization and numerical representations. For this, we include, on
every level of the benchmark several numerical representations, including FP32, FP16, INT8, BIN,
TERN, and allow for arbitrary choices to be included, for example Microso’s custom oating
point [40]. Training each neural network with dierent quantization approaches and dierent and
potentially esoteric numerical representations is highly time-intensive.. erefore, careful logging
of trained quantized models is a high priority for level 3.
Frameworks & Datasets Datasets are a key input to the benchmark and impact accuracy
results. We rely on open source datasets exclusively. Framework support is expected to be one
of the biggest challenges since each framework is directly connected with a neural network and
datasets within an application context and models are not necessarily portable. erefore, we
need operational hardware backends for a diverse set of AI accelerators which may or may not be
available. Furthermore, quantization is not necessarily mainstream in frameworks. It is not yet
clear to what extent cross compilation tools such as TVM [83] can help, while exchange formats
such as ONNX [65] are still immature, lack adoption and very importantly full quantization support.
Training scripts exposing all hyperparameters, training initializations and so on must be fully
logged as they can have signicant impact on accuracy.
Power and Energy To represent power and energy cost, we only report platform power mea-
sured at the socket. While this is not necessarily accurate, there are strong reasons behind this
choice. First, the measurement needs to be fair, therefore we believe subsystems, including memory
specically need to be taken into account. Second, more detailed current sampling on the platforms
may be available on some platforms, but each platform comes with dierent interfaces, and may or
may not provide access to all power rails. While the accuracy of typical socket power meters is
around 10%, we found that these results remain representative of the systems. Furthermore, we
average the results over 10 measurements.
Another consideration is whether to consider power or energy per frame. We seled on using
absolute power consumption since when multithreading or batching is applied, it is hard to derive
a representative number for energy and would dier depending on whether the end application is
latency or throughput driven. Finally, idle power with these platforms, can represent a signicant
percentage of the overall power budget and would therefore cloud the observation. In particular
one FPGA platform is an evaluation board with many peripherals, which is reected in high idle
power (19.9 Was) compared to the GPU (between 3.4 to 5.0 Wa depending on operating mode),
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:19
while the additional dynamic power consumption is minimal and yields the FPGA overall as the
more ecient platforms despite the initial load.
Testbeds, Reproducibility, & Recorded Measurements In order to provide useful scientic
results, all experiments and measurements must be validated and reproducible. Specically:
• All input data to the test suites must be openly accessible.
• Many platforms can be made available through virtualized compute environments, which is
adequate if the cost is not prohibitive. However some platforms may not be available. erefore,
an open testbed may be advisable and considered as an extension to this benchmark.
• As the higher levels of benchmarks may require a long time to run and hardware may not be
available, we advocate recording of results, whereby each entry will be validated by a third party
such that results are guaranteed to be a) reproducible and b) correct.
Our colleagues in the Request Tournament eort [68] leverage ACM’s rigorous artifact evaluation
technology and the Collective Knowledge Workow Framework [15] and do an outstanding job
addressing this. We aim to adopt the same principles.
Adaptability Machine Learning is currently a highly dynamic eld, and specic algorithms
may become very quickly outdated and new models may emerge and take over rapidly. We plan to
adapt fast and add/retire models as machine learning science matures.
8 EXPERIMENTAL RESULTS & EVALUATION
We present measured results aimed at evaluating the dened benchmarking tests and gures of
merit to ensure that they accurately reect a system’s capabilities. For test platforms, we used the
Nvidia TX2 GPU and the Xilinx ZCU104 FPGA. For both platforms, we carried out all levels of tests
on one specic Machine Learning task, ImageNet classication, for two dierent neural networks,
GoogleNetV1 and ResNet50. We use FP32, FP16 (supported by GPU), and INT8 (supported by FPGA)
as numerical representations, a form of algorithmic optimization. We run GPU platforms with a
spectrum of batch sizes and dierent operating modes (MaxN, MaxQ, MaxP), which are optimized
for dierent performance and power consumption targets6. For FPGAs, there are a spectrum of
implementations available. We exercise the Deephi DPU overlay, which uses threads instead of
batch sizes to achieve high system utilization, and therefore exercise a spectrum of thread counts.
For FPGAs we show the theoretical limits of the current implementation (which is clocked at
666MHz), as well as the datasheet peak performance of 750MHz. For GPUs, we use the theoretical
peak as dictated by the clock frequencies dened by the operating mode. Full experimental results
are provided in the appendix. We currently have only exercised inference results to validate the
benchmark methodology. In the following, we evaluate each benchmarking level individually and
then provide a rst critical review of these early results.
Level 0 Using values for hardware platforms and arithmetic intensity (AI) we created rooines
for the target platforms and performance predictions for both networks7. Fig. 9 shows that both
NNs will be compute bound for INT8, FP16 and FP32. e arithmetic intensity should be higher
for larger batch sizes (batch size of 1 is shown), but the performance prediction for larger batch
sizes will be identical. e theoretical performance prediction can be derived from this and is
summarized in table 8. ese numbers are used to compute eciency for levels 1, 2 and 3.
Level 1 and Level 2 We restrict the evaluation of level 1 and level 2 to ResNet50, as this is sucient
to make the key observations. e ResNet50 topology is relatively regular in structure, consisting of a top
convolutional layer with pooling combination, 16 residual blocks, and a fully connected layer. Each residual
block is comprised of thresholding layers, convolutions, and elementwise additions. As the convolutions
6We also tested MaxP, however never achieved optimal values for any gure of merit.
7We assumed that all weights are kept o-chip and all intermediate results are on-chip. is assumption will be revised in
the future.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:20 M. Blo et al.
Fig. 9. TX2 and ZCU104 Level0 Rooflines with GoogleNet and ResNet50
Table 8. Level 0 - Performance Predictions
Performance TX2 ZCU104
Predictions [TOPs] FP32-MaxN FP16-MaxN FP32-MaxQ FP16-MaxQ INT8-666MHz INT8-750MHz
ResNet50 = GoogleNetV1 0.667 1.333 0.437 0.874 4.604 5.357
account for the majority of the compute, we focus mainly on the convolutional layers of the network. Since
the platform-specic frameworks perform layer fusion as network optimization, level 1 represents the smallest
possible fused layer structure. Table 16 shows level 1 and level 2 latency results for one TX2 hardware
conguration (MaxN, FP16) with dierent batch sizes as well as level 1 results for ZCU104 with dierent
thread numbers. We restrict level 1 to convolutions of dierent sizes and select the residual layers res2a, res3a,
res4a and res5a to get an overview over the whole network. Level 2 results are provided for all residual layers
of the network. Due to limited support by the hardware-specic framework, it is not possible to benchmark
level 2 on FPGA platforms. We observe a large discrepancy in execution time for dierent residual stacks, even
though the compute requirements within each is similar. It is likely that data movement varies signicantly
depending on the incoming and outgoing tensor dimensions. erefore, it is important to include as many
layer types inside level 1 and 2 testing. We would expect this to be even more pronounced for other topologies,
as they may be less balanced than ResNet50. We also observe a large discrepancy between the performance of
dierent convolutional layers (Table 16, level 1). Unlike the residual blocks, this is anticipated, as they come
with very dierent compute requirements. Furthermore, the dierences are more pronounced with larger
batch size. It is therefore our plan to include the full spectrum of convolutional layers within level 1.
Fig. 10. Performance comparison layer0, layer1, layer2 and layer3 for TX2 (MaxN, FP16 configuration)
Multi-Tiered Concept Fig. 10 depicts the performance measurements of the various levels. We restricted
the visualized experiments to MaxN, FP16 conguration on TX2, and a subset of microbenchmarks on level
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:21
1 and level 2, for a spectrum of batch sizes. Note that the theoretical peak performance is signicantly
higher than measured performance, only within reach of individual layers that t the hardware architecture
well. e system (level 3) achieves from 41.1 to 60.7% eciency, where larger batch sizes achieve higher
performance. Level 2 results are on average more negative than achieved performance (level 3) and a fairly
good approximation within 16% of the achievable level 3 system performance, but far o level 3 compute
performance. Level 1 results have usually beer performance than the level 2 results. is makes intuitively
sense, as a limited amount of bolenecks are exposed during execution of the benchmark. In particular
lower weight storage is required, which is most likely contained on-chip, thereby alleviating any potential
memory bolenecks. Also it can be said that the averaged level 1 results provide a good estimation of
possible compute performance on level 3. As already mentioned, for level 1 and 2 results, we observe large
variations in performance ranges for dierent dimensions of convolutions. e insight is that to provide a good
projection from level 1 or level 2 to level 3, we need to provide full coverage of convolutional layers. Another
challenge is that many backend tools perform automated layer fusion such as merging batch normalization
with convolutions, which makes testing in isolation inaccurate.
Level 3 - Full system level performance evaluation e aim of level 3 is to explore optimal solutions
within the design space regarding application performance independent of model topology and algorithmic
optimizations. We include results for both platforms (TX2, ZCU104), for INT8, FP16, FP32, across the spectrum
of batch sizes and thread numbers for both GoogleNetV1 and ResNet50. See plots of pareto points (Fig. 11)
and results in the Appendix. We made the following key observations: Firstly, the ZCU104 FPGA provides
the highest system level (948GOPs) and compute level performance (1067GOPs) compared to the GPU
platform (809GOPs and 1011GOPs respectively) for both GoogleNetV1 and ResNet5050 (Fig. 11, top le). For
GoogleNetV1, the FPGA provides beer performance and accuracy. For ResNet50, the FPGA provides beer
performance but lower accuracy compared to the GPU platform. Further, GoogleNetV1 topology provides
more than 2x the performance compared to ResNet50, due to the signicantly lower compute per frame
required as part of the neural network topology, while ResNet50 provides best accuracy across the platforms.
e accuracy dierence is 1.59% for the FPGA and 4.27% for the GPU (Fig. 11 top le). Additionally, the
ZCU104 outperforms the TX2 in regards to latency by orders of magnitude and across topologies unless GPUs
operate with small batch size, where the performance eciency drops. GPU latency varies from a minimum
of 8ms to a maximum of 1838.5ms for batch=128. FPGA latency varies from 9.65ms to 65ms. Finally, the GPU
platform is more power ecient,which can be aributed to the GPU platform being more optimized, whereas
the FPGA platform is more general purpose. is is apparent when considering idle power (Sec. 7, 5 Was for
TX2 and 19.9 for ZCU104).
Fig. 11. Level 3: System Performance Evaluation
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:22 M. Blo et al.
In this evaluation above we consider full system-level performance (Fig. 11), including initial data movement
as well as compute only performance. Depending on the end application, it may be important to factor out the
initial data movement from the overall time, as the inference engine might be included in a larger compute
data path, where the inputs are streamed directly from on-chip resources. However when analyzing the
experimental data points for both GPU and FPGA platforms, it appears that the dierence is very regular in
nature, and it is not obvious that a distinction within the benchmark is necessary (see Appendix) as long as it
is clearly indicated what is measured. e pareto curves are an eective means to compare dierent topologies
and dierent platforms leaving space for algorithmic optimizations. We plan to leverage 3- or 4-dimensional
graphs to additionally explore relationships between latency and system-level performance.
9 CONCLUSION & FUTUREWORK
Neural networks are fast gaining popularity across an increasing number of applications. However, they are
accompanied by challenging compute and memory requirements, as shown in Section 3, which is seriously
challenging the semiconductor industry which is facing performance scalability issues. is is of particular
importance for embedded computing environments, where real estate, power and available compute and
memory resources are at a premium. As such the industry is turning to both algorithmic innovation in form
of new topologies, quantization, and pruning strategies, as well as architectural innovation with more and
more heterogeneous devices and the emergence of specialized DPUs. To facilitate beer insights into the
increasingly complex space of end solutions which involve hardware-soware codesign and evaluate new
concepts in computer architecture, novel NN system benchmarks are needed.
TiBench is a proposed novel benchmarking methodology to help drive hardware innovation and provide
insights for system level designers in understanding possible performance accuracy trade-os for newly
devised and ne-tuned algorithms combined with highly customized accelerators. Key contributions are that
we provide concepts that allow benchmarking of highly optimized algorithms by tying hardware characteristics
back to the end application, thereby providing the needed algorithmic freedom. Another key dierentiator in
this benchmarking concept is the introduction of the multi-tiered approach including a theoretical level and
consideration of a spectrum of numerical representations at all levels. As such the benchmark can provide
insights at various abstraction levels. is brings two key advantages: a) it provides a spectrum of insights and
users can choose from instant but perhaps crude results, to elaborate results which require longer evaluation;
and b) the multi-tiered approach provides insights into system bolenecks. For example, are the recurrent or
the fully connected layers the challenge? Or is the boleneck the data movement in between? We present
initial experimental results on two types of neural network topologies aimed at image classication tasks, and
exercise them on two dierent types of hardware platforms for all levels of the proposed benchmarks. We
present some of the lessons learned while exercising the benchmarks, challenges encountered, and analyze
the quality of the results in regards to real system performance at the various levels.
is eort is just beginning. Future work will focus on rening details and running broader experimentation.
We plan to expand on level 0 results rst and build out test suites targeting FPGAs, GPUs, CPUs and DPUs
within the embedded space. Many concepts regarding reproducibility need to be rened, as well as automated
soware testing infrastructure as proposed by deep500.org. Also collaboration with larger eorts such as
MLPerf will be benecial to gain traction. We invite the research community to contribute to TiBench.
ACKNOWLEDGMENTS
e authors would like to thank the FINN team at Xilinx research, Prof. Ce Zhang at ETH Zurich, and the
Deephi team for insights and support. Miriam Leeser is supported in part by the National Science Foundation
under Grant No. 1717213.
REFERENCES
[1] Robert Adolf, Saketh Rama, Brandon Reagen, Gu-Yeon Wei, and David Brooks. 2016. Fathom: Reference workloads for
modern deep learning methods. In IISWC2016. IEEE, 1–10.
[2] Jorge Albericio, Patrick Judd, et al. 2016. Cnvlutin: Ineectual-neuron-free deep neural network computing. Computer
Architecture News 44, 3 (2016), 1–13.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:23
[3] Hande Alemdar, Vincent Leroy, Adrien Prost-Boucle, and Fre´de´ric Pe´trot. 2017. Ternary neural networks for resource-
ecient AI applications. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2547–2554.
[4] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, et al. 2016. Deep speech 2: End-to-end speech
recognition in English and Mandarin. In ICML2016. 173–182.
[5] Kota Ando, Kodai Ueyoshi, et al. 2017. BRein memory: A 13-layer 4.2 K neuron/0.8 M synapse binary/ternary
recongurable in-memory deep neural network accelerator in 65 nm CMOS. In VLSI Circuits. IEEE, C24–C25.
[6] ARM 2018. ARM’s Project Trillium. hps://www.arm.com/products/processors/machine-learning
[7] Lucian Armasu. 2016. Deep Learning On A Stick: Movidius’ ’Fathom’ Neural Compute Stick. Retrieved June 14, 2018
from hps://www.tomshardware.com/news/movidius-fathom-neural-compute-stick,31694.html
[8] Michaela Blo, omas B Preußer, Nicholas Fraser, et al. 2017. Scaling Neural Network Performance through
Customized Hardware Architectures on Recongurable Logic. In ICCD. IEEE, 419–422.
[9] Greg Brockman et al. 2016. Openai gym. arXiv preprint arXiv:1606.01540 (2016).
[10] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. 2017. Deep learning with low precision by half-wave
Gaussian quantization. arXiv preprint arXiv:1702.00953 (2017).
[11] Tianshi Chen, Yunji Chen, et al. 2012. BenchNN: On the broad potential application scope of hardware neural network
accelerators. In Workload Characterization (IISWC). IEEE, 36–45.
[12] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, et al. 2014. Diannao: A small-footprint high-
throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices, Vol. 49. ACM, 269–284.
[13] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An energy-ecient recongurable
accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.
[14] Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, et al. 2018. Serving DNNs in Real Time at
Datacenter Scale with Project Brainwave. IEEE Micro 38, 2 (2018), 8–20.
[15] ckframework 2017. Collective Knowledge. Retrieved June 14, 2018 from hp://cknowledge.org/
[16] Cody Coleman, Daniel Kang, et al. 2018. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance
Benchmark. arXiv preprint arXiv:1806.01427 (2018).
[17] Mahieu Courbariaux, Itay Hubara, Daniel Soudry, et al. 2016. Binarized neural networks: Training deep neural
networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[18] Bala´zs Csana´d Csa´ji. 2001. Approximation with articial neural networks. Etvs Lornd University, Hungary 24 (2001),
48.
[19] Anthony Danalis, Gabriel Marin, et al. 2010. e scalable heterogeneous computing (SHOC) benchmark suite. In
General-Purpose Computation on Graphics Processing Units. ACM, 63–74.
[20] DeepBench 2018. DeepBench. Retrieved June 13, 2018 from hps://svail.github.io/DeepBench/
[21] Yan Duan, Xi Chen, Rein Houthoo, John Schulman, and Pieter Abbeel. 2016. Benchmarking deep reinforcement
learning for continuous control. In International Conference on Machine Learning. 1329–1338.
[22] Luke Durant, Olivier Giroux, Mark Harris, and Nick Stam. 2017. Inside Volta: e World’s Most Advanced Data Center
GPU. Retrieved June 12, 2018 from hps://devblogs.nvidia.com/inside-volta/
[23] Benoit Jacob et al. 2017. gemmlowp: a small self-contained low-precision GEMM library. Retrieved June 14, 2018
from hps://github.com/google/gemmlowp/
[24] Exxactcorp 2017. Taking A Deeper Look at the AMD Radeon Instinct GPUs for Deep Learning. Retrieved June 12,
2018 from hp://blog.exxactcorp.com/taking-deeper-look-amd-radeon-instinct-gpus-deep-learning/
[25] Eyeriss 2018. Benchmarking DNN Processors. hp://eyeriss.mit.edu/benchmarking.html
[26] Dustin Franklin. 2017. NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge. Retrieved June 14, 2018 from
hps://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/
[27] Nicholas J Fraser, Yaman Umuroglu, Giulio Gambardella, Michaela Blo, Philip Leong, Magnus Jahre, and Kees Vissers.
2017. Scaling binarized neural networks on recongurable logic. In PARMA DITAM2017. 25–30.
[28] Yijia Gu, omas Wahl, Mahsa Bayati, and Miriam Leeser. 2015. Behavioral non-portability in scientic numeric
computing. In European Conference on Parallel Processing. Springer, 558–569.
[29] Kaiyuan Guo, Song Han, Song Yao, Yu Wang, Yuan Xie, and Huazhong Yang. 2017. Soware-Hardware Codesign for
Ecient Neural Network Acceleration. IEEE Micro 37, 2 (2017), 18–25.
[30] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited
numerical precision. In ICML 2015. 1737–1746.
[31] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al.
2017. Ese: Ecient speech recognition engine with sparse lstm on fpga. In FPGA 2017. ACM, 75–84.
[32] Song Han, Xingyu Liu, et al. 2016. EIE: ecient inference engine on compressed deep neural network. In International
Symposium on Computer Architecture (ISCA). IEEE, 243–254.
[33] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning,
trained quantization and human coding. arXiv preprint arXiv:1510.00149 (2015).
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:24 M. Blo et al.
[34] Devindra Hardawar. 2018. AMD’s Radeon Vega GPU is headed everywhere, even to machine learning. Retrieved
June 14, 2018 from hps://www.engadget.com/2018/01/08/amd-radeon-vega-mobile/
[35] Johann Hauswald, Yiping Kang, et al. 2015. DjiNN and Tonic: DNN as a service and its implications for future
warehouse scale computers. 43, 3 (2015), 27–40.
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In
European conference on computer vision. Springer, 630–645.
[37] Forrest Iandola, Ma Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. 2014. Densenet:
Implementing ecient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869 (2014).
[38] Forrest N Iandola, Song Han, Mahew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016.
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360
(2016).
[39] IBM 2018. Blog: Unlocking the Promise of Approximate Computing for On-Chip AI Accelerator. hps://www.ibm.
com/blogs/research/2018/06/approximate-computing-ai-acceleration/
[40] Intel 2016. Ecient Implementation of Neural Network Systems Built on FPGAs and Programmed with
OpenCL. hps://www.altera.com/content/dam/altera-www/global/en US/pdfs/literature/solution-sheets/ecient
neural networks.pdf
[41] Norman P Jouppi, Cli Young, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In
International Symposium on Computer Architecture. ACM, 1–12.
[42] Patrick Judd, Jorge Albericio, Tayler Hetherington, et al. 2016. Stripes: Bit-serial deep neural network computing. In
MICRO 2016. 1–12.
[43] Kalray 2018. Whitepaper: Deep Learning on MPPA Manycore Processor. Retrieved June 14, 2018 from hp:
//www.kalrayinc.com/resources/
[44] Minje Kim and Paris Smaragdis. 2016. Bitwise neural networks. arXiv preprint arXiv:1601.06071 (2016).
[45] Urs Ko¨ster, Tristan Webb, et al. 2017. Flexpoint: An adaptive numerical format for ecient training of deep neural
networks. In NIPS 2017. 1742–1752.
[46] Laice 2018. Binarized Neural Network (BNN) Accelerator IP. Retrieved June 17, 2018 from hp://www.laicesemi.
com/Products/DesignSowareAndIP/IntellectualProperty/IPCore/IPCores04/BNN
[47] Yann LeCun, Le´on Boou, Yoshua Bengio, and Patrick Haner. 1998. Gradient-based learning applied to document
recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
[48] Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2018. MLBench: Benchmarking Machine Learning
Services Against Human Experts. Retrieved June 12, 2018 from hps://www.microso.com/en-us/research/publication/
mlbench-benchmarking-machine-learning-services-human-experts/
[49] Liang Lu. 2017. Toward Computation and Memory Ecient Neural Network Acoustic Models with Binary Weights
and Activations. Vol. arXiv:1706.09453.
[50] Chao Ma, Wei An, Yinjie Lei, and Yulan Guo. 2017. BV-CNNs: Binary Volumetric Convolutional Networks for 3D
Object Recognition. In Proceedings of the British Machine Vision Conference 2017, BMVC.
[51] A Cristiano I Malossi, Michael Schaner, et al. 2018. e transprecision computing paradigm: Concept, design, and
applications. In DATE. IEEE, 1105–1110.
[52] John D. McCalpin. 2018. STREAM: Sustainable Memory Bandwidth in High Performance Computers. Retrieved June
14, 2018 from hps://www.cs.virginia.edu/stream/
[53] Linda Doyle Michaela Blo, Miriam Leeser. 2018. TiBench: Benchmarking Neural Networks on Heterogeneous
Hardware. Retrieved June 14, 2018 from hps://github.com/michaelablo/TiBench
[54] Paulius Micikevicius, Sharan Narang, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).
[55] Asit Mishra and Debbie Marr. 2017. Apprentice: Using knowledge distillation techniques to improve low-precision
network accuracy. arXiv preprint arXiv:1711.05852 (2017).
[56] Asit Mishra, Eriko Nurvitadhi, Jerey J Cook, and Debbie Marr. 2017. WRPN: Wide Reduced-Precision Networks.
arXiv preprint arXiv:1709.01134 (2017).
[57] MLPerf 2018. MLPerf: A broad ML benchmark suite for measuring performance of ML soware frameworks, ML
hardware accelerators, and ML cloud platforms. Retrieved June 14, 2018 from hps://mlperf.org/
[58] Bert Moons, Daniel Bankman, Lita Yang, et al. 2018. BinarEye: An always-on energy-accuracy-scalable binary CNN
processor with all memory on chip in 28nm CMOS. In Custom Integrated Circuits Conference (CICC). IEEE.
[59] Duncan JM Moss, Srivatsan Krishnan, et al. 2018. A Customizable Matrix Multiplication Framework for the Intel
HARPv2 Xeon+ FPGA Platform: A Deep Learning Case Study. In FPGA. ACM, 107–116.
[60] Movidius 2017. Product Brief: MyriadX: Enhanced Visual Intelligence at the Network Edge. Retrieved June 14, 2018
from hps://uploads.movidius.com/1503874473-MyriadXVPU ProductBriefaug25.pdf
[61] John Murphy. 2017. Deep Learning Benchmarks of NVIDIA Tesla P100 PCIe,
Tesla K80, and Tesla M40 GPUs. hps://www.microway.com/hpc-tech-tips/
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:25
deep-learning-benchmarks-nvidia-tesla-p100-16gb-pcie-tesla-k80-tesla-m40-gpus/
[62] NNEF 2018. NNEF. Retrieved June 14, 2018 from hps://www.khronos.org/nnef
[63] Eriko Nurvitadhi et al. 2017. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?. In
FPGA.
[64] NVIDIA. 2016. Tesla P40 Inferencing Accelerator. Retrieved June 12, 2018 from hp://images.nvidia.com/content/pdf/
tesla/184427-Tesla-P40-Datasheet-NV-Final-Leer-Web.pdf
[65] ONNX 2018. Open Neural Network Exchange Format. Retrieved June 13, 2018 from hps://onnx.ai
[66] Angshuman Parashar, Minsoo Rhu, et al. 2017. Scnn: An accelerator for compressed-sparse convolutional neural
networks. In International Symposium on Computer Architecture (ISCA). IEEE, 27–40.
[67] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classication
using binary convolutional neural networks. In European Conference on Computer Vision. Springer, 525–542.
[68] ReEST 2018. ReEST ’18: Proceedings of the 1st on Reproducibleality-Ecient Systems Tournament on Co-designing
Pareto-ecient Deep Learning. ACM, New York, NY, USA.
[69] Olga Russakovsky, Jia Deng, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. IJCV 115, 3 (2015),
211–252. hps://doi.org/10.1007/s11263-015-0816-y
[70] Vladimir Rybalkin et al. 2018. FINN-L: Library Extensions and Design Trade-o Analysis for Variable Precision LSTM
Networks on FPGA. arXiv preprint hps://arxiv.org/abs/1807.04093 (2018).
[71] K Sato et al. 2017. An In-Depth Look at Googles First Tensor Processing Unit (TPU). Retrieved June 12, 2018 from
hps://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-rst-tensor-processing-unit-tpu
[72] SPEC 2018. Standard Performance Evaluation Corporation. Retrieved June 14, 2018 from hp://spec.org
[73] Wenyun Sun, Haitao Zhao, and Zhong Jin. 2017. An Ecient Unconstrained Facial Expression Recognition Algorithm
Based on Stack-Binarized Auto-Encoders and Binarized Neural Networks. Neurocomputing 267 (2017), 385–395.
[74] Wonyong Sung, Sungho Shin, and Kyuyeon Hwang. 2015. Resiliency of deep neural networks under quantization.
arXiv preprint arXiv:1511.06488 (2015).
[75] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. 2017. Ecient processing of deep neural networks: A
tutorial and survey. Proc. IEEE 105, 12 (2017), 2295–2329.
[76] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Sco Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR 2015. 1–9.
[77] Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
architecture for computer vision. In CVPR 2016. 2818–2826.
[78] Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca Benin. 2018. A transprecision oating-
point platform for ultra-low power computing. In DATE 2018. IEEE, 1051–1056.
[79] Jin-Hua Tao, Zi-Dong Du, Qi Guo, et al. 2018. BenchIP: Benchmarking Intelligence Processors. Journal of Computer
Science and Technology 33, 1 (2018), 1–23.
[80] Paul Teich. 2018. Tearing Apart Google’s TPU 3.0 AI Coprocessor. Retrieved June 12, 2018 from hps://www.
nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/
[81] Shelby omas, Chetan Gohkale, Enrico Tanuwidjaja, et al. 2014. CortexSuite: A synthetic brain benchmark suite. In
Workload Characterization (IISWC). IEEE, 76–79.
[82] TPC 2018. Active TPC Benchmarks. hp://www.tpc.org/information/benchmarks.asp
[83] TVM 2018. End to End Deep Learning Compiler Stack. Retrieved June 13, 2018 from hps://tvm.ai
[84] Yaman Umuroglu, Nicholas J Fraser, et al. 2017. Finn: A framework for fast, scalable binarized neural network inference.
In FPGA. ACM, 65–74.
[85] Yaman Umuroglu and Magnus Jahre. 2017. Streamlined Deployment for antized Neural Networks. arXiv preprint
arXiv:1709.04060 (2017).
[86] Yaman Umuroglu, Lahiru Rasnayake, and Magnus Sjalander. 2018. BISMO: A Scalable Bit-Serial Matrix Multiplication
Overlay for Recongurable Computing. arXiv preprint arXiv:1806.08862 (2018).
[87] Wikipedia. 2018. AMD RX Vega series. hps://en.wikipedia.org/wiki/AMD RX Vega series
[88] S. Williams, A. Waterman, and D. Paerson. 2009. Rooine: An Insightful Visual Performance Model for Multicore
Architectures. Commun. ACM 52, 4 (2009), 65–76.
[89] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. 2018. Training and inference with integers in deep neural networks.
arXiv preprint arXiv:1802.04680 (2018).
[90] Yonghui Wu, Mike Schuster, Zhifeng Chen, et al. 2016. Google’s neural machine translation system: Bridging the gap
between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
[91] Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
[92] Shijin Zhang, Zidong Du, et al. 2016. Cambricon-x: An accelerator for sparse neural networks. In International
Symposium on Microarchitecture. IEEE Press, 20.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:26 M. Blo et al.
[93] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low
bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
[94] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. 2016. Trained ternary quantization. arXiv preprint
arXiv:1612.01064 (2016).
[95] Hongyu Zhu, Mohamed Akrout, Bojian Zheng, et al. 2018. TBD: Benchmarking and Analyzing Deep Neural Network
Training. arXiv preprint arXiv:1803.06905 (2018).
[96] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. 2017. Towards Eective Low-bitwidth
Convolutional Neural Networks. In other words 2 (2017), 2.
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:27
A APPENDIX: TABLE OF RESULTS
Table 9. Planned Applications, Datasets and Models
Learning Technique Application TiBench
Dataset Model
Supervised Vision Image Classication ImageNet, MNIST ResNet50, MobileNet(V1), GoogleNet, MLP
Vision Object Detection Pascal VOC SSD-ResNet34, YoloV2
Vision Semantic Segmentation Pascal VOC Mask R-CNN, SSD-MobileNet
NLP Machine Translation WMT’14 English-to-French&German GNMT [90]
NLP Speech Recognition Librispeech DeepSpeech2
NLP Sentiment Analysis SST, IMDB, SemEval2018 Multiplicative LSTM
NLP Language Modeling babI Memory Network
Recommendation Movies Movielens 20M NCF
Unsupervised Vision Feature Extraction MNIST autoencoder
Vision Adversarial Learning ImangeNet WGAN
Deep Reinforcement Learning Game Go Go MiniGo
Atari ALE Atari ALE DeepQ
Table 10. Level 0 - Hardware Platforms & Neural Network Model
Hardware Platform datatype Figures of Merit (theo.) Model Figures of Merit (theo.)
[TOPs] [GBps] [Watts] [$] [GOP] Size [ME] AI [OP:Byte]
Nvidia Jetson TX2 MaxN FP32 0.67 59.7 NA 469 ResNet50 (b=1, INT8) 7.72 25.50 303
Nvidia Jetson TX2 MaxP FP32 0.57 59.7 15.0 469 ResNet50 (b=8, INT8) 7.72 25.50 2422
Nvidia Jetson TX2 MaxQ FP32 0.44 59.7 7.5 469 ResNet50 (b=1, FP16) 7.72 25.50 151
Nvidia Jetson TX2 MaxN FP16 1.33 59.7 NA 469 ResNet50 (b=8, FP16) 7.72 25.50 1211
Nvidia Jetson TX2 MaxP FP16 1.15 59.7 15.0 469 GoogleNetV1 (b=1, INT8) 3.13 5.98 523
Nvidia Jetson TX2 MaxQ FP16 0.87 59.7 7.5 469 GoogleNetV1 (b=8, INT8) 3.13 5.98 4188
Xilinx ZCU104 DPU 666MHz INT8 4.60 19.2 NA 895 GoogleNetV1 (b=1, FP16) 3.13 5.98 262
Xilinx ZCU104 DPU 775MHz INT8 5.36 19.2 NA 895 GoogleNetV1 (b=8, FP16) 3.13 5.98 2094
Fig. 12. System versus Compute Performance
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:28 M. Blo et al.
Table 11. Level 1 -ZCU104 Inference Results ResNet50 Individual Convolutional Layers
ZCU104 Network Parameters Figures of Merit
thread=1 thread=8
Layer [MOP] in dim in ch lter stride out ch Latency roughput (E) Latency roughput (E)
[ms] [GOPs] [ms] [GOPs]
res2a branch2a 25.7 56 64 1 1 64 0.060 428.05 (0.09) 0.082 630.21 (0.14)
res2a branch2b 231.2 56 64 3 1 64 0.190 1216.84 (0.26) 0.190 2428.70 (0.53)
res2a branch2c 102.8 56 64 1 1 256 0.220 467.23 (0.10) 0.258 798.45 (0.17)
res2a branch1 102.8 56 64 1 1 256 0.430 239.08 (0.05) 0.464 443.34 (0.10)
res2b branch2a 102.8 56 256 1 1 64 0.142 725.63 (0.16) 0.196 1049.57 (0.23)
res2b branch2b 231.2 56 64 3 1 64 0.190 1216.84 (0.26) 0.190 2428.19 (0.53)
res2b branch2c 102.8 56 64 1 1 256 0.429 239.64 (0.05) 0.463 443.98 (0.10)
res2c branch2a 102.8 56 256 1 1 64 0.140 734.13 (0.16) 0.193 1063.14 (0.23)
res2c branch2b 231.2 56 64 3 1 64 0.190 1216.84 (0.26) 0.190 2428.32 (0.53)
res2c branch2c 102.8 56 64 1 1 256 0.435 236.20 (0.05) 0.462 444.69 (0.10)
res3a branch2a 51.4 28 256 1 2 128 0.090 571.05 (0.12) 0.128 800.12 (0.17)
res3a branch2b 231.2 28 128 3 1 128 0.210 1100.95 (0.24) 0.214 2159.44 (0.47)
res3a branch2c 102.8 28 128 1 1 512 0.210 489.52 (0.11) 0.247 832.79 (0.18)
res3a branch1 205.5 28 256 1 2 512 0.330 622.71 (0.14) 0.390 1052.87 (0.23)
res3b branch2a 102.8 28 512 1 1 128 0.120 856.45 (0.19) 0.148 1391.07 (0.30)
res3b branch2b 231.2 28 128 3 1 128 0.210 1100.95 (0.24) 0.214 2165.20 (0.47)
res3b branch2c 102.8 28 128 1 1 512 0.320 321.24 (0.07) 0.353 582.88 (0.13)
res3c branch2a 102.8 28 512 1 1 128 0.120 856.60 (0.19) 0.151 1361.41 (0.30)
res3c branch2b 231.2 28 128 3 1 128 0.210 1100.95 (0.24) 0.215 2154.81 (0.47)
res3c branch2c 102.8 28 128 1 1 512 0.303 339.02 (0.07) 0.354 580.14 (0.13)
res3d branch2a 102.8 28 512 1 1 128 0.120 856.52 (0.19) 0.149 1383.86 (0.30)
res3d branch2b 231.2 28 128 3 1 128 0.210 1100.95 (0.24) 0.214 2165.50 (0.47)
res3d branch2c 102.8 28 128 1 1 512 0.301 341.20 (0.07) 0.353 582.72 (0.13)
res4a branch2a 51.4 14 512 1 2 256 0.120 428.80 (0.09) 0.133 774.21 (0.17)
res4a branch2b 231.2 14 256 3 1 256 0.210 1100.95 (0.24) 0.230 2011.48 (0.44)
res4a branch2c 102.8 14 256 1 1 1024 0.290 354.46 (0.08) 0.379 541.92 (0.12)
res4a branch1 205.5 14 512 1 2 1024 0.430 477.87 (0.10) 0.500 821.34 (0.18)
res4b branch2a 102.8 14 1024 1 1 256 0.130 790.71 (0.17) 0.162 1271.41 (0.28)
res4b branch2b 231.2 14 256 3 1 256 0.210 1100.95 (0.24) 0.229 2015.61 (0.44)
res4b branch2c 102.8 14 256 1 1 1024 0.350 293.69 (0.06) 0.436 471.20 (0.10)
res4c branch2a 102.8 14 1024 1 1 256 0.130 790.71 (0.17) 0.163 1263.60 (0.27)
res4c branch2b 231.2 14 256 3 1 256 0.210 1100.95 (0.24) 0.231 2002.86 (0.44)
res4c branch2c 102.8 14 256 1 1 1024 0.360 285.52 (0.06) 0.438 469.22 (0.10)
res4d branch2a 102.8 14 1024 1 1 256 0.130 790.65 (0.17) 0.164 1251.14 (0.27)
res4d branch2b 231.2 14 256 3 1 256 0.210 1100.95 (0.24) 0.229 2019.57 (0.44)
res4d branch2c 102.8 14 256 1 1 1024 0.350 293.76 (0.06) 0.425 484.02 (0.11)
res4e branch2a 102.8 14 1024 1 1 256 0.130 790.71 (0.17) 0.162 1267.18 (0.28)
res4e branch2b 231.2 14 256 3 1 256 0.210 1100.90 (0.24) 0.230 2014.73 (0.44)
res4e branch2c 102.8 14 256 1 1 1024 0.350 293.68 (0.06) 0.438 469.19 (0.10)
res4f branch2a 102.8 14 1024 1 1 256 0.130 790.53 (0.17) 0.162 1265.78 (0.27)
res4f branch2b 231.2 14 256 3 1 256 0.210 1100.95 (0.24) 0.230 2007.12 (0.44)
res4f branch2c 102.8 14 256 1 1 1024 0.360 285.49 (0.06) 0.421 488.23 (0.11)
res5a branch2a 51.4 7 1024 1 2 512 0.120 427.94 (0.09) 0.188 546.52 (0.12)
res5a branch2b 231.2 7 512 3 1 512 0.330 699.93 (0.15) 0.493 937.72 (0.20)
res5a branch2c 102.8 7 512 1 1 2048 0.470 218.66 (0.05) 0.600 342.79 (0.07)
res5a branch1 205.5 7 1024 1 2 2048 0.517 397.60 (0.09) 0.691 594.55 (0.13)
res5b branch2a 102.8 7 2048 1 1 512 0.170 604.28 (0.13) 0.272 755.16 (0.16)
res5b branch2b 231.2 7 512 3 1 512 0.331 698.07 (0.15) 0.499 926.34 (0.20)
res5b branch2c 102.8 7 512 1 1 2048 0.500 205.56 (0.04) 0.628 327.34 (0.07)
res5c branch2a 102.8 7 2048 1 1 512 0.170 604.49 (0.13) 0.265 775.00 (0.17)
res5c branch2b 231.2 7 512 3 1 512 0.340 679.68 (0.15) 0.503 918.94 (0.20)
res5c branch2c 102.8 7 512 1 1 2048 0.500 205.55 (0.04) 0.632 325.22 (0.07)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:29
Table 12. Level 1 -TX2 (MaxN, FP16) Inference Results ResNet50 Individual Convolutional Layers
TX2 Network Parameters Figures of Merit
MaxN, FP16, batch=1 MaxN, FP16, batch=128
Layer [MOP] in dim in ch lter stride out ch Latency roughput (E) Latency roughput (E)
[ms] [GOPs] [ms] [GOPs]
res2a branch2a 25.7 56 64 1 1 64 0.06 414.52 (0.31) 5.05 651.15 (0.49)
res2a branch2b 231.2 56 64 3 1 64 0.19 1197.93 (0.90) 22.78 1299.39 (0.97)
res2a branch2c 102.8 56 64 1 1 256 0.18 577.53 (0.43) 20.15 653.15 (0.49)
res2a branch1 102.8 56 64 1 1 256 0.21 487.20 (0.37) 23.66 556.19 (0.42)
res2b branch2a 102.8 56 256 1 1 64 0.13 778.79 (0.58) 13.74 957.60 (0.72)
res2b branch2b 231.2 56 64 3 1 64 0.19 1210.47 (0.91) 22.87 1293.82 (0.97)
res2b branch2c 102.8 56 64 1 1 256 0.21 489.52 (0.37) 23.68 555.77 (0.42)
res2c branch2a 102.8 56 256 1 1 64 0.13 784.73 (0.59) 13.74 957.88 (0.72)
res2c branch2b 231.2 56 64 3 1 64 0.19 1210.47 (0.91) 22.85 1295.01 (0.97)
res2c branch2c 102.8 56 64 1 1 256 0.21 489.52 (0.37) 23.67 555.82 (0.42)
res3a branch2a 51.4 28 256 1 2 128 0.09 584.09 (0.44) 7.19 915.30 (0.69)
res3a branch2b 231.2 28 128 3 1 128 0.21 1095.73 (0.82) 24.63 1201.33 (0.90)
res3a branch2c 102.8 28 128 1 1 512 0.15 694.59 (0.52) 15.18 866.82 (0.65)
res3a branch1 205.5 28 256 1 2 512 0.29 718.53 (0.54) 30.35 866.60 (0.65)
res3b branch2a 102.8 28 512 1 1 128 0.13 767.16 (0.58) 12.07 1090.45 (0.82)
res3b branch2b 231.2 28 128 3 1 128 0.21 1106.22 (0.83) 24.66 1199.92 (0.90)
res3b branch2c 102.8 28 128 1 1 512 0.16 634.57 (0.48) 16.68 788.87 (0.59)
res3c branch2a 102.8 28 512 1 1 128 0.13 767.16 (0.58) 12.10 1087.11 (0.82)
res3c branch2b 231.2 28 128 3 1 128 0.21 1100.95 (0.83) 24.47 1209.19 (0.91)
res3c branch2c 102.8 28 128 1 1 512 0.16 634.57 (0.48) 16.71 787.65 (0.59)
res3d branch2a 102.8 28 512 1 1 128 0.13 767.16 (0.58) 12.12 1085.95 (0.81)
res3d branch2b 231.2 28 128 3 1 128 0.21 1106.22 (0.83) 24.69 1198.56 (0.90)
res3d branch2c 102.8 28 128 1 1 512 0.16 630.67 (0.47) 16.67 789.35 (0.59)
res4a branch2a 51.4 14 512 1 2 256 0.08 642.50 (0.48) 7.10 926.26 (0.69)
res4a branch2b 231.2 14 256 3 1 256 0.20 1185.64 (0.89) 23.12 1279.89 (0.96)
res4a branch2c 102.8 14 256 1 1 1024 0.15 708.97 (0.53) 13.01 1011.64 (0.76)
res4a branch1 205.5 14 512 1 2 1024 0.28 728.72 (0.55) 29.23 899.87 (0.68)
res4b branch2a 102.8 14 1024 1 1 256 0.13 784.73 (0.59) 11.55 1139.45 (0.85)
res4b branch2b 231.2 14 256 3 1 256 0.20 1179.59 (0.88) 22.33 1325.28 (0.99)
res4b branch2c 102.8 14 256 1 1 1024 0.15 680.79 (0.51) 13.75 957.18 (0.72)
res4c branch2a 102.8 14 1024 1 1 256 0.13 778.79 (0.58) 11.62 1132.10 (0.85)
res4c branch2b 231.2 14 256 3 1 256 0.20 1173.60 (0.88) 22.99 1287.35 (0.97)
res4c branch2c 102.8 14 256 1 1 1024 0.15 680.79 (0.51) 13.76 956.14 (0.72)
res4d branch2a 102.8 14 1024 1 1 256 0.13 778.79 (0.58) 11.57 1137.09 (0.85)
res4d branch2b 231.2 14 256 3 1 256 0.20 1185.64 (0.89) 22.92 1291.17 (0.97)
res4d branch2c 102.8 14 256 1 1 1024 0.15 680.79 (0.51) 13.76 956.00 (0.72)
res4e branch2a 102.8 14 1024 1 1 256 0.13 778.79 (0.58) 11.59 1135.32 (0.85)
res4e branch2b 231.2 14 256 3 1 256 0.20 1185.64 (0.89) 22.85 1295.41 (0.97)
res4e branch2c 102.8 14 256 1 1 1024 0.15 680.79 (0.51) 13.78 954.89 (0.72)
res4f branch2a 102.8 14 1024 1 1 256 0.13 784.73 (0.59) 11.65 1129.96 (0.85)
res4f branch2b 231.2 14 256 3 1 256 0.20 1179.59 (0.88) 22.40 1321.26 (0.99)
res4f branch2c 102.8 14 256 1 1 1024 0.15 680.79 (0.51) 13.78 955.17 (0.72)
res5a branch2a 51.4 7 1024 1 2 512 0.14 372.46 (0.28) 7.61 864.77 (0.65)
res5a branch2b 231.2 7 512 3 1 512 0.31 748.22 (0.56) 24.90 1188.59 (0.89)
res5a branch2c 102.8 7 512 1 1 2048 0.27 386.47 (0.29) 12.53 1049.90 (0.79)
res5a branch1 205.5 7 1024 1 2 2048 0.51 406.93 (0.31) 30.92 850.85 (0.64)
res5b branch2a 102.8 7 2048 1 1 512 0.22 475.93 (0.36) 11.35 1159.74 (0.87)
res5b branch2b 231.2 7 512 3 1 512 0.30 763.04 (0.57) 24.91 1188.26 (0.89)
res5b branch2c 102.8 7 512 1 1 2048 0.27 382.16 (0.29) 13.21 995.87 (0.75)
res5c branch2a 102.8 7 2048 1 1 512 0.22 473.73 (0.36) 11.39 1155.36 (0.87)
res5c branch2b 231.2 7 512 3 1 512 0.31 753.09 (0.56) 24.91 1187.88 (0.89)
res5c branch2c 102.8 7 512 1 1 2048 0.28 371.12 (0.28) 13.09 1005.53 (0.75)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:30 M. Blo et al.
Table 13. Level 2 - Inference Results ResNet50 Residual Layers
MaxN MaxQ MaxP
HW Layer Parameters Lat roughput (E) Lat roughput (E) Lat roughput (E)
[ms] [GOPs] ([%]) [ms] [GOPs] ([%]) [ms] [GOPs] ([%])
TX2 res2a FP16, b=1 1.37 431.27 (0.32) 1.90 292.13 (0.25) 1.58 371.40 (0.42)
TX2 res2a FP16, b=2 2.17 464.25 (0.35) 3.08 314.23 (0.27) 2.50 401.64 (0.46)
TX2 res2a FP16, b=4 3.97 481.73 (0.36) 5.83 325.90 (0.28) 4.61 418.69 (0.48)
TX2 res2a FP16, b=8 7.69 491.73 (0.37) 11.30 330.22 (0.29) 8.91 426.23 (0.49)
TX2 res2a FP16, b=16 15.11 495.85 (0.37) 22.39 333.24 (0.29) 17.48 428.54 (0.49)
TX2 res2a FP16, b=32 30.39 436.04 (0.33) 44.20 333.95 (0.29) 34.49 430.49 (0.49)
TX2 res2a FP16, b=64 60.41 492.24 (0.37) 88.98 334.18 (0.29) 68.93 430.10 (0.49)
TX2 res2a FP16, b=128 119.12 495.33 (0.37) 177.37 333.95 (0.29) 137.24 430.10 (0.49)
TX2 res2b FP16, b=1 1.12 443.23 (0.33) 1.57 303.00 (0.26) 1.32 382.90 (0.44)
TX2 res2b FP16, b=2 1.95 481.92 (0.36) 2.77 333.25 (0.29) 2.29 416.41 (0.48)
TX2 res2b FP16, b=4 3.63 502.50 (0.38) 5.24 343.22 (0.30) 4.19 437.17 (0.50)
TX2 res2b FP16, b=8 6.95 509.95 (0.38) 10.06 352.14 (0.31) 7.99 445.00 (0.51)
TX2 res2b FP16, b=16 13.62 515.82 (0.39) 19.90 354.66 (0.31) 15.86 448.12 (0.51)
TX2 res2b FP16, b=32 27.19 518.82 (0.39) 39.57 356.64 (0.31) 31.05 451.28 (0.52)
TX2 res2b FP16, b=64 54.32 515.82 (0.39) 78.80 356.35 (0.31) 62.09 452.20 (0.52)
TX2 res2b FP16, b=128 108.24 517.02 (0.39) 158.53 355.50 (0.31) 124.29 450.38 (0.52)
TX2 res2c FP16, b=1 1.12 446.33 (0.33) 1.59 301.77 (0.26) 1.32 380.94 (0.44)
TX2 res2c FP16, b=2 1.96 483.48 (0.36) 2.75 333.75 (0.29) 2.27 419.53 (0.48)
TX2 res2c FP16, b=4 3.60 504.19 (0.38) 5.16 346.15 (0.30) 4.17 438.88 (0.50)
TX2 res2c FP16, b=8 6.89 512.87 (0.38) 10.12 347.49 (0.30) 8.04 446.33 (0.51)
TX2 res2c FP16, b=16 13.59 516.42 (0.39) 19.86 355.50 (0.31) 15.70 451.28 (0.52)
TX2 res2c FP16, b=32 27.01 518.22 (0.39) 39.22 356.92 (0.31) 31.14 451.74 (0.52)
TX2 res2c FP16, b=64 54.36 515.82 (0.39) 78.66 356.64 (0.31) 62.06 448.57 (0.51)
TX2 res2c FP16, b=128 108.07 516.42 (0.39) 158.11 355.22 (0.31) 123.99 449.92 (0.51)
TX2 res3a FP16, b=1 1.39 475.68 (0.36) 1.96 323.22 (0.28) 1.66 406.90 (0.47)
TX2 res3a FP16, b=2 2.38 523.41 (0.39) 3.45 356.13 (0.31) 2.81 449.19 (0.51)
TX2 res3a FP16, b=4 4.39 555.61 (0.42) 6.43 374.19 (0.33) 5.18 476.43 (0.55)
TX2 res3a FP16, b=8 8.46 563.90 (0.42) 12.39 385.63 (0.34) 9.80 490.72 (0.56)
TX2 res3a FP16, b=16 16.53 575.15 (0.43) 24.45 390.61 (0.34) 19.22 495.95 (0.57)
TX2 res3a FP16, b=32 32.86 576.80 (0.43) 48.62 391.37 (0.34) 38.07 498.40 (0.57)
TX2 res3a FP16, b=64 65.46 577.35 (0.43) 96.24 393.92 (0.34) 75.66 500.05 (0.57)
TX2 res3a FP16, b=128 133.09 568.13 (0.43) 194.49 389.61 (0.34) 151.72 496.77 (0.57)
TX2 res3b FP16, b=1 1.03 511.11 (0.38) 1.41 347.49 (0.30) 1.21 438.45 (0.50)
TX2 res3b FP16, b=2 1.69 562.54 (0.42) 2.40 385.54 (0.34) 1.96 488.23 (0.56)
TX2 res3b FP16, b=4 3.05 597.89 (0.45) 4.43 407.31 (0.35) 3.58 517.02 (0.59)
TX2 res3b FP16, b=8 5.80 616.86 (0.46) 8.45 420.72 (0.37) 6.75 533.04 (0.61)
TX2 res3b FP16, b=16 11.26 629.89 (0.47) 16.50 426.74 (0.37) 12.95 544.73 (0.62)
TX2 res3b FP16, b=32 22.24 630.78 (0.47) 33.51 428.78 (0.37) 25.61 547.40 (0.63)
TX2 res3b FP16, b=64 44.38 628.12 (0.47) 65.79 434.20 (0.38) 51.07 546.72 (0.63)
TX2 res3b FP16, b=128 87.72 636.16 (0.48) 129.87 434.62 (0.38) 101.77 548.74 (0.63)
TX2 res3c FP16, b=1 1.05 506.48 (0.38) 1.42 343.22 (0.30) 1.22 436.74 (0.50)
TX2 res3c FP16, b=2 1.70 561.84 (0.42) 2.43 380.61 (0.33) 1.96 486.64 (0.56)
TX2 res3c FP16, b=4 3.06 598.69 (0.45) 4.46 406.57 (0.35) 3.56 515.82 (0.59)
TX2 res3c FP16, b=8 5.81 613.47 (0.46) 8.47 419.93 (0.37) 6.70 536.24 (0.61)
TX2 res3c FP16, b=16 11.19 630.78 (0.47) 16.40 428.78 (0.37) 12.97 541.43 (0.62)
TX2 res3c FP16, b=32 22.05 632.56 (0.47) 32.55 430.43 (0.37) 25.68 546.72 (0.63)
TX2 res3c FP16, b=64 43.94 637.07 (0.48) 64.34 434.20 (0.38) 50.46 552.13 (0.63)
TX2 res3c FP16, b=128 87.87 636.16 (0.48) 129.02 431.68 (0.38) 101.34 550.09 (0.63)
TX2 res3d FP16, b=1 1.04 504.19 (0.38) 1.40 347.76 (0.30) 1.22 438.45 (0.50)
TX2 res3d FP16, b=2 1.70 561.13 (0.42) 2.40 385.20 (0.34) 1.95 488.77 (0.56)
TX2 res3d FP16, b=4 3.08 592.35 (0.44) 4.43 408.05 (0.36) 3.55 515.23 (0.59)
TX2 res3d FP16, b=8 5.84 613.47 (0.46) 8.41 423.10 (0.37) 6.71 536.88 (0.61)
TX2 res3d FP16, b=16 11.25 624.61 (0.47) 16.48 427.15 (0.37) 12.93 546.72 (0.63)
TX2 res3d FP16, b=32 22.04 631.67 (0.47) 32.74 430.85 (0.37) 25.51 551.45 (0.63)
TX2 res3d FP16, b=64 44.27 633.46 (0.48) 64.21 435.89 (0.38) 50.51 553.49 (0.63)
TX2 res3d FP16, b=128 88.66 630.78 (0.47) 129.91 431.27 (0.38) 101.91 549.41 (0.63)
TX2 res4a FP16, b=1 1.20 575.15 (0.43) 1.66 386.37 (0.34) 1.39 496.77 (0.57)
TX2 res4a FP16, b=2 2.09 604.46 (0.45) 3.01 411.33 (0.36) 2.41 524.32 (0.60)
TX2 res4a FP16, b=4 3.74 654.12 (0.49) 5.46 446.87 (0.39) 4.34 569.20 (0.65)
TX2 res4a FP16, b=8 7.00 688.35 (0.52) 10.16 472.33 (0.41) 8.03 600.26 (0.69)
TX2 res4a FP16, b=16 13.40 706.02 (0.53) 19.71 480.59 (0.42) 15.37 621.85 (0.71)
TX2 res4a FP16, b=32 26.40 713.52 (0.54) 38.60 493.12 (0.43) 30.30 627.01 (0.72)
TX2 res4a FP16, b=64 52.32 722.89 (0.54) 76.46 494.33 (0.43) 60.23 628.96 (0.72)
TX2 res4a FP16, b=128 104.66 723.76 (0.54) 152.47 496.77 (0.43) 119.12 633.57 (0.72)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:31
MaxN MaxQ MaxP
HW Layer Parameters Lat roughput (E) Lat roughput (E) Lat roughput (E)
[ms] [GOPs] ([%]) [ms] [GOPs] ([%]) [ms] [GOPs] ([%])
TX2 res4b FP16, b=1 1.03 599.49 (0.45) 1.40 407.31 (0.35) 1.16 514.64 (0.59)
TX2 res4b FP16, b=2 1.53 644.41 (0.48) 2.11 437.60 (0.38) 1.74 562.54 (0.64)
TX2 res4b FP16, b=4 2.60 706.51 (0.53) 3.81 474.76 (0.41) 3.03 610.96 (0.70)
TX2 res4b FP16, b=8 4.87 740.43 (0.56) 7.16 497.46 (0.43) 5.58 648.15 (0.74)
TX2 res4b FP16, b=16 9.33 763.18 (0.57) 13.65 520.03 (0.45) 10.82 660.60 (0.76)
TX2 res4b FP16, b=32 18.13 769.75 (0.58) 26.64 526.15 (0.46) 20.84 674.54 (0.77)
TX2 res4b FP16, b=64 36.31 776.43 (0.58) 52.95 530.51 (0.46) 41.50 675.56 (0.77)
TX2 res4b FP16, b=128 71.33 780.49 (0.59) 105.55 529.88 (0.46) 82.15 680.70 (0.78)
TX2 res4c FP16, b=1 1.03 598.69 (0.45) 1.40 408.42 (0.36) 1.16 511.11 (0.58)
TX2 res4c FP16, b=2 1.52 647.21 (0.49) 2.12 437.60 (0.38) 1.74 557.63 (0.64)
TX2 res4c FP16, b=4 2.63 695.53 (0.52) 3.79 474.76 (0.41) 3.06 602.72 (0.69)
TX2 res4c FP16, b=8 5.17 697.69 (0.52) 7.11 499.69 (0.43) 5.61 644.41 (0.74)
TX2 res4c FP16, b=16 9.29 764.48 (0.57) 13.64 518.82 (0.45) 10.74 660.60 (0.76)
TX2 res4c FP16, b=32 18.15 773.74 (0.58) 26.63 528.63 (0.46) 21.05 666.50 (0.76)
TX2 res4c FP16, b=64 35.81 780.49 (0.59) 52.86 530.51 (0.46) 41.59 674.54 (0.77)
TX2 res4c FP16, b=128 72.52 773.74 (0.58) 105.65 527.39 (0.46) 82.21 680.70 (0.78)
TX2 res4d FP16, b=1 1.02 599.49 (0.45) 1.40 408.05 (0.36) 1.16 514.64 (0.59)
TX2 res4d FP16, b=2 1.52 650.98 (0.49) 2.12 438.02 (0.38) 1.74 558.33 (0.64)
TX2 res4d FP16, b=4 2.62 703.18 (0.53) 3.82 474.26 (0.41) 3.02 613.47 (0.70)
TX2 res4d FP16, b=8 4.84 745.37 (0.56) 7.13 503.06 (0.44) 5.60 644.41 (0.74)
TX2 res4d FP16, b=16 9.29 763.18 (0.57) 13.58 521.24 (0.45) 10.72 661.57 (0.76)
TX2 res4d FP16, b=32 18.32 769.75 (0.58) 26.49 529.88 (0.46) 20.92 673.53 (0.77)
TX2 res4d FP16, b=64 36.17 775.08 (0.58) 53.11 528.63 (0.46) 41.41 678.64 (0.78)
TX2 res4d FP16, b=128 72.01 779.13 (0.58) 105.92 526.15 (0.46) 82.10 679.67 (0.78)
TX2 res4e FP16, b=1 1.02 601.10 (0.45) 1.40 407.31 (0.35) 1.16 514.64 (0.59)
TX2 res4e FP16, b=2 1.53 649.09 (0.49) 2.12 437.17 (0.38) 1.75 560.43 (0.64)
TX2 res4e FP16, b=4 2.64 696.61 (0.52) 3.81 474.26 (0.41) 3.03 610.13 (0.70)
TX2 res4e FP16, b=8 4.85 740.43 (0.56) 7.19 500.81 (0.44) 5.63 644.41 (0.74)
TX2 res4e FP16, b=16 9.33 758.00 (0.57) 13.67 518.22 (0.45) 10.68 663.53 (0.76)
TX2 res4e FP16, b=32 18.10 771.07 (0.58) 26.59 526.15 (0.46) 20.95 671.51 (0.77)
TX2 res4e FP16, b=64 35.83 780.49 (0.59) 52.81 531.14 (0.46) 41.24 676.58 (0.77)
TX2 res4e FP16, b=128 72.39 775.08 (0.58) 105.40 529.25 (0.46) 82.50 679.67 (0.78)
TX2 res4f FP16, b=1 1.02 599.49 (0.45) 1.39 408.42 (0.36) 1.17 514.05 (0.59)
TX2 res4f FP16, b=2 1.51 650.03 (0.49) 2.12 437.17 (0.38) 1.75 559.03 (0.64)
TX2 res4f FP16, b=4 2.64 699.88 (0.53) 3.80 474.76 (0.41) 3.02 610.96 (0.70)
TX2 res4f FP16, b=8 4.83 741.66 (0.56) 7.10 500.81 (0.44) 5.60 642.56 (0.74)
TX2 res4f FP16, b=16 9.27 763.18 (0.57) 13.57 521.85 (0.45) 10.76 664.52 (0.76)
TX2 res4f FP16, b=32 18.24 767.10 (0.58) 26.66 525.52 (0.46) 21.02 669.49 (0.77)
TX2 res4f FP16, b=64 36.10 771.07 (0.58) 52.77 532.41 (0.46) 41.20 675.56 (0.77)
TX2 res4f FP16, b=128 71.76 776.43 (0.58) 105.63 533.68 (0.46) 82.10 677.61 (0.78)
TX2 res5a FP16, b=1 1.73 413.29 (0.31) 2.43 281.55 (0.25) 1.98 357.60 (0.41)
TX2 res5a FP16, b=2 2.05 634.24 (0.48) 2.88 430.04 (0.37) 2.34 552.57 (0.63)
TX2 res5a FP16, b=4 3.68 664.17 (0.50) 5.44 447.20 (0.39) 4.29 576.80 (0.66)
TX2 res5a FP16, b=8 7.30 659.82 (0.49) 10.75 445.88 (0.39) 8.45 572.43 (0.65)
TX2 res5a FP16, b=16 13.52 707.67 (0.53) 19.64 484.05 (0.42) 15.51 615.53 (0.70)
TX2 res5a FP16, b=32 25.37 746.07 (0.56) 36.80 514.95 (0.45) 29.04 656.96 (0.75)
TX2 res5a FP16, b=64 48.45 778.71 (0.58) 71.37 532.16 (0.46) 56.06 675.29 (0.77)
TX2 res5a FP16, b=128 95.61 791.96 (0.59) 140.12 540.72 (0.47) 110.13 686.01 (0.78)
TX2 res5b FP16, b=1 1.24 456.82 (0.34) 1.71 307.16 (0.27) 1.41 391.96 (0.45)
TX2 res5b FP16, b=2 1.60 666.50 (0.50) 2.29 447.22 (0.39) 1.84 572.63 (0.66)
TX2 res5b FP16, b=4 2.66 704.29 (0.53) 3.85 472.75 (0.41) 3.12 607.64 (0.70)
TX2 res5b FP16, b=8 4.98 723.66 (0.54) 7.40 481.40 (0.42) 5.79 626.36 (0.72)
TX2 res5b FP16, b=16 8.79 810.18 (0.61) 13.01 540.78 (0.47) 10.17 704.29 (0.81)
TX2 res5b FP16, b=32 16.33 861.70 (0.65) 24.27 577.06 (0.50) 18.76 752.90 (0.86)
TX2 res5b FP16, b=64 31.47 892.66 (0.67) 47.00 596.29 (0.52) 36.27 775.08 (0.89)
TX2 res5b FP16, b=128 61.55 907.14 (0.68) 92.43 606.82 (0.53) 71.35 787.36 (0.90)
TX2 res5c FP16, b=1 1.23 460.58 (0.35) 1.71 312.74 (0.27) 1.38 398.95 (0.46)
TX2 res5c FP16, b=2 1.59 665.51 (0.50) 2.28 448.57 (0.39) 1.84 571.89 (0.65)
TX2 res5c FP16, b=4 2.66 705.40 (0.53) 3.85 469.77 (0.41) 3.10 610.13 (0.70)
TX2 res5c FP16, b=8 4.99 720.16 (0.54) 7.48 480.37 (0.42) 5.80 626.36 (0.72)
TX2 res5c FP16, b=16 8.79 811.66 (0.61) 13.08 540.78 (0.47) 10.11 703.18 (0.80)
TX2 res5c FP16, b=32 16.42 863.36 (0.65) 24.29 576.32 (0.50) 18.80 746.62 (0.85)
TX2 res5c FP16, b=64 31.34 892.66 (0.67) 47.02 596.29 (0.52) 36.12 777.78 (0.89)
TX2 res5c FP16, b=128 61.58 907.14 (0.68) 92.32 603.54 (0.53) 71.16 787.36 (0.90)
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:32 M. Blo et al.
Table 14. Level 3 - Inference Results ResNet50
ResNet50 Top-5 (Top-1) Acc Latency [ms] roughput [GOPs] (Eciency [%]) Power [W]
Platform Parameters [%] system compute system compute
ZCU104 INT8, t=1 90.85 (72.53) 17.96 14.96 324.91 (0.08) 516.07 (0.13) 21.41
ZCU104 INT8, t=2 90.85 (72.53) 19.69 16.63 786.49 (0.19) 931.12 (0.23) 25.78
ZCU104 INT8, t=3 90.85 (72.53) 25.37 22.46 906.78 (0.22) 1029.32 (0.25) 26.39
ZCU104 INT8, t=4 90.85 (72.53) 33.46 30.48 918.25 (0.23) 1017.81 (0.25) 26.53
ZCU104 INT8, t=5 90.85 (72.53) 40.28 37.26 943.85 (0.23) 1067.31 (0.26) 26.82
ZCU104 INT8, t=6 90.85 (72.53) 47.95 45.27 946.97 (0.23) 1056.38 (0.26) 26.83
ZCU104 INT8, t=7 90.85 (72.53) 56.74 53.61 943.37 (0.23) 1047.84 (0.26) 26.86
ZCU104 INT8, t=8 90.85 (72.53) 64.88 62.30 948.05 (0.23) 1044.11 (0.26) 26.89
TX2, MaxN FP16, b=1 92.12 (75.11) 13.99 10.68 547.84 (0.41) 725.23 (0.54) 12.57
TX2, MaxN FP16, b=2 92.12 (75.11) 23.65 18.25 651.39 (0.49) 855.59 (0.64) 13.57
TX2, MaxN FP16, b=4 92.12 (75.11) 43.79 34.23 703.76 (0.53) 911.67 (0.68) 13.76
TX2, MaxN FP16, b=8 92.12 (75.11) 84.79 65.86 728.19 (0.55) 937.28 (0.70) 13.92
TX2, MaxN FP16, b=16 92.12 (75.11) 162.58 126.23 759.76 (0.57) 976.60 (0.73) 13.69
TX2, MaxN FP16, b=32 92.12 (75.11) 317.87 247.34 779.00 (0.58) 999.84 (0.75) 13.70
TX2, MaxN FP16, b=64 92.12 (75.11) 620.08 490.37 799.24 (0.60) 1006.85 (0.76) 13.76
TX2, MaxN FP16, b=128 92.12 (75.11) 1211.85 975.98 809.47 (0.61) 1011.95 (0.76) 13.82
TX2, MaxN FP32, b=1 92.11 (75.15) 22.32 18.97 344.88 (0.26) 407.51 (0.31) 14.58
TX2, MaxN FP32, b=2 92.11 (75.15) 38.46 32.96 401.14 (0.30) 470.87 (0.35) 15.02
TX2, MaxN FP32, b=4 92.11 (75.15) 72.96 62.96 423.04 (0.32) 491.63 (0.37) 15.05
TX2, MaxN FP32, b=8 92.11 (75.15) 141.13 122.18 437.53 (0.33) 506.07 (0.38) 15.19
TX2, MaxN FP32, b=16 92.11 (75.15) 272.41 235.85 453.44 (0.34) 523.42 (0.39) 15.34
TX2, MaxN FP32, b=32 92.11 (75.15) 531.12 460.67 465.45 (0.35) 536.20 (0.40) 15.39
TX2, MaxN FP32, b=64 92.11 (75.15) 1042.67 913.42 473.23 (0.36) 539.88 (0.41) 15.51
TX2, MaxN FP32, b=128 92.11 (75.15) 2115.54 1810.90 462.78 (0.35) 544.51 (0.41) 15.21
TX2, MaxQ FP16, b=1 92.12 (75.11) 20.46 15.86 376.05 (0.28) 496.32 (0.37) 6.83
TX2, MaxQ FP16, b=2 92.12 (75.11) 34.65 26.69 444.47 (0.33) 582.64 (0.44) 6.94
TX2, MaxQ FP16, b=4 92.12 (75.11) 64.53 50.01 477.72 (0.36) 618.73 (0.46) 7.00
TX2, MaxQ FP16, b=8 92.12 (75.11) 124.75 96.69 494.70 (0.37) 638.71 (0.48) 7.06
TX2, MaxQ FP16, b=16 92.12 (75.11) 239.00 185.17 516.18 (0.39) 666.23 (0.50) 7.13
TX2, MaxQ FP16, b=32 92.12 (75.11) 466.49 362.02 529.31 (0.40) 682.18 (0.51) 7.12
TX2, MaxQ FP16, b=64 92.12 (75.11) 924.53 717.12 534.18 (0.40) 687.93 (0.52) 7.13
TX2, MaxQ FP16, b=128 92.12 (75.11) 1838.48 1429.00 536.31 (0.40) 691.03 (0.52) 7.15
TX2, MaxQ FP32, b=1 92.11 (75.15) 32.66 27.94 235.37 (0.18) 279.20 (0.21) 7.60
TX2, MaxQ FP32, b=2 92.11 (75.15) 56.36 48.32 273.37 (0.21) 320.87 (0.24) 7.79
TX2, MaxQ FP32, b=4 92.11 (75.15) 106.84 92.09 288.78 (0.22) 335.67 (0.25) 7.77
TX2, MaxQ FP32, b=8 92.11 (75.15) 207.45 179.45 297.69 (0.22) 344.35 (0.26) 7.85
TX2, MaxQ FP32, b=16 92.11 (75.15) 398.74 344.35 309.54 (0.23) 358.82 (0.27) 7.97
TX2, MaxQ FP32, b=32 92.11 (75.15) 779.69 673.93 316.63 (0.24) 366.42 (0.27) 7.99
TX2, MaxQ FP32, b=64 92.11 (75.15) 1540.33 1333.24 320.57 (0.24) 370.36 (0.28) 8.03
TX2, MaxQ FP32, b=128 92.11 (75.15) 3118.09 2650.98 315.38 (0.24) 372.08 (0.28) 7.93
TX2, MaxP FP16, b=1 92.12 (75.11) 16.52 12.46 464.14 (0.35) 632.05 (0.47) 9.38
TX2, MaxP FP16, b=2 92.12 (75.11) 28.10 20.92 547.57 (0.41) 745.09 (0.56) 9.59
TX2, MaxP FP16, b=4 92.12 (75.11) 52.22 39.14 590.78 (0.44) 790.58 (0.59) 9.71
TX2, MaxP FP16, b=8 92.12 (75.11) 100.21 75.54 615.58 (0.46) 818.10 (0.61) 9.81
TX2, MaxP FP16, b=16 92.12 (75.11) 193.89 145.36 637.01 (0.48) 849.79 (0.64) 9.79
TX2, MaxP FP16, b=32 92.12 (75.11) 375.86 283.47 657.79 (0.49) 870.94 (0.65) 9.79
TX2, MaxP FP16, b=64 92.12 (75.11) 733.74 562.74 673.19 (0.51) 879.15 (0.66) 9.81
TX2, MaxP FP16, b=128 92.12 (75.11) 1453.87 1121.27 677.63 (0.51) 879.65 (0.66) 9.86
TX2, MaxP FP32, b=1 92.11 (75.15) 26.16 22.00 293.60 (0.22) 355.65 (0.27) 10.54
TX2, MaxP FP32, b=2 92.11 (75.15) 45.07 37.86 341.72 (0.26) 410.11 (0.31) 10.83
TX2, MaxP FP32, b=4 92.11 (75.15) 85.31 72.25 361.93 (0.27) 428.79 (0.32) 10.86
TX2, MaxP FP32, b=8 92.11 (75.15) 165.16 140.36 373.86 (0.28) 440.83 (0.33) 11.01
TX2, MaxP FP32, b=16 92.11 (75.15) 318.38 270.54 387.72 (0.29) 456.71 (0.34) 11.13
TX2, MaxP FP32, b=32 92.11 (75.15) 621.30 528.54 397.67 (0.30) 467.21 (0.35) 11.15
TX2, MaxP FP32, b=64 92.11 (75.15) 1219.10 1046.82 404.38 (0.30) 471.56 (0.35) 11.24
TX2, MaxP FP32, b=128 92.11 (75.15) 2495.53 2076.54 393.22 (0.29) 474.95 (0.36) 10.88
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
TiBench: Benchmarking Neural Networks on Heterogeneous Hardware 1:33
Table 15. Level 3 - Inference Results GoogleNetV1
GoogLeNet Top-5 (Top-1) Acc Latency [ms] roughput [GOPs] (Eciency [%]) Power [W]
Platform Parameters [%] system compute system compute
ZCU104 INT8, t=1 89.26 (69.49) 9.65 6.68 323.50 (0.08) 468.64 (0.12) 21.49
ZCU104 INT8, t=2 89.26 (69.49) 9.99 7.06 499.97 (0.12) 895.19 (0.22) 24.60
ZCU104 INT8, t=3 89.26 (69.49) 11.88 9.30 784.16 (0.19) 1050.36 (0.26) 25.41
ZCU104 INT8, t=4 89.26 (69.49) 15.85 12.95 782.99 (0.19) 971.12 (0.24) 25.50
ZCU104 INT8, t=5 89.26 (69.49) 18.24 15.27 848.31 (0.21) 1122.66 (0.28) 25.96
ZCU104 INT8, t=6 89.26 (69.49) 21.88 18.35 851.75 (0.21) 1116.67 (0.27) 26.00
ZCU104 INT8, t=7 89.26 (69.49) 25.60 22.78 853.81 (0.21) 1081.32 (0.27) 25.93
ZCU104 INT8, t=8 89.26 (69.49) 28.91 25.94 846.70 (0.21) 1062.84 (0.26) 25.90
TX2, MaxN FP16, b=1 87.85 (66.94) 7.98 5.17 388.92 (0.29) 612.39 (0.46) 11.52
TX2, MaxN FP16, b=2 87.85 (66.94) 14.68 9.23 438.58 (0.33) 693.35 (0.52) 11.40
TX2, MaxN FP16, b=4 87.85 (66.94) 26.31 16.36 475.75 (0.36) 770.88 (0.58) 12.00
TX2, MaxN FP16, b=8 87.85 (66.94) 49.83 30.96 508.99 (0.38) 811.73 (0.61) 12.21
TX2, MaxN FP16, b=16 87.85 (66.94) 95.81 59.47 544.72 (0.41) 843.35 (0.63) 12.34
TX2, MaxN FP16, b=32 87.85 (66.94) 185.96 116.62 538.49 (0.40) 859.48 (0.64) 11.85
TX2, MaxN FP16, b=64 87.85 (66.94) 389.12 231.45 511.92 (0.38) 862.22 (0.65) 11.62
TX2, MaxN FP16, b=128 87.85 (66.94) 810.86 459.85 493.02 (0.37) 871.35 (0.65) 11.29
TX2, MaxN FP32, b=1 87.84 (66.94) 12.01 8.71 259.52 (0.19) 361.81 (0.27) 12.79
TX2, MaxN FP32, b=2 87.84 (66.94) 21.01 15.62 297.38 (0.22) 406.52 (0.30) 13.06
TX2, MaxN FP32, b=4 87.84 (66.94) 38.37 28.47 325.96 (0.24) 441.59 (0.33) 13.35
TX2, MaxN FP32, b=8 87.84 (66.94) 72.78 53.96 320.83 (0.24) 464.59 (0.35) 13.56
TX2, MaxN FP32, b=16 87.84 (66.94) 141.92 104.99 353.06 (0.26) 477.28 (0.36) 13.70
TX2, MaxN FP32, b=32 87.84 (66.94) 279.20 206.82 361.42 (0.27) 484.49 (0.36) 13.77
TX2, MaxN FP32, b=64 87.84 (66.94) 571.52 409.48 347.85 (0.26) 488.15 (0.37) 13.53
TX2, MaxN FP32, b=128 87.84 (66.94) 1165.91 811.57 342.21 (0.26) 490.41 (0.37) 13.17
TX2, MaxQ FP16, b=1 87.85 (66.94) 12.42 7.71 249.66 (0.19) 420.01 (0.32) 5.88
TX2, MaxQ FP16, b=2 87.85 (66.94) 21.48 13.39 290.21 (0.22) 474.86 (0.36) 6.01
TX2, MaxQ FP16, b=4 87.85 (66.94) 38.55 23.99 323.91 (0.24) 524.72 (0.39) 6.13
TX2, MaxQ FP16, b=8 87.85 (66.94) 73.30 45.40 340.72 (0.26) 552.66 (0.41) 6.20
TX2, MaxQ FP16, b=16 87.85 (66.94) 141.55 87.55 353.38 (0.27) 571.79 (0.43) 6.24
TX2, MaxQ FP16, b=32 87.85 (66.94) 277.12 172.07 362.00 (0.27) 582.64 (0.44) 6.24
TX2, MaxQ FP16, b=64 87.85 (66.94) 574.23 340.85 346.43 (0.26) 588.26 (0.44) 6.09
TX2, MaxQ FP16, b=128 87.85 (66.94) 1182.94 680.24 335.61 (0.25) 589.83 (0.44) 6.03
TX2, MaxQ FP32, b=1 87.84 (66.94) 17.54 12.82 177.36 (0.13) 249.47 (0.19) 6.70
TX2, MaxQ FP32, b=2 87.84 (66.94) 30.80 22.77 202.46 (0.15) 277.85 (0.21) 6.89
TX2, MaxQ FP32, b=4 87.84 (66.94) 56.49 41.81 220.94 (0.17) 300.29 (0.23) 7.02
TX2, MaxQ FP32, b=8 87.84 (66.94) 107.28 79.34 233.39 (0.18) 316.27 (0.24) 7.11
TX2, MaxQ FP32, b=16 87.84 (66.94) 208.77 154.43 239.76 (0.18) 324.46 (0.24) 7.16
TX2, MaxQ FP32, b=32 87.84 (66.94) 410.81 303.79 244.50 (0.18) 330.06 (0.25) 7.18
TX2, MaxQ FP32, b=64 87.84 (66.94) 835.95 602.25 238.58 (0.18) 332.87 (0.25) 7.08
TX2, MaxQ FP32, b=128 87.84 (66.94) 1702.86 1196.52 232.88 (0.17) 333.96 (0.25) 7.02
TX2, MaxP FP16, b=1 87.85 (66.94) 9.69 6.09 318.13 (0.24) 532.93 (0.40) 8.07
TX2, MaxP FP16, b=2 87.85 (66.94) 16.66 10.53 374.39 (0.28) 606.07 (0.45) 8.28
TX2, MaxP FP16, b=4 87.85 (66.94) 31.93 18.85 415.17 (0.31) 668.51 (0.50) 8.47
TX2, MaxP FP16, b=8 87.85 (66.94) 60.46 35.59 428.38 (0.32) 705.81 (0.53) 8.61
TX2, MaxP FP16, b=16 87.85 (66.94) 108.66 68.56 459.58 (0.34) 731.15 (0.55) 8.69
TX2, MaxP FP16, b=32 87.85 (66.94) 226.68 134.27 443.19 (0.33) 745.48 (0.56) 8.35
TX2, MaxP FP16, b=64 87.85 (66.94) 478.58 266.78 409.86 (0.31) 749.71 (0.56) 8.07
TX2, MaxP FP16, b=128 87.85 (66.94) 1019.48 530.68 392.21 (0.29) 753.68 (0.57) 7.74
TX2, MaxP FP32, b=1 87.84 (66.94) 14.35 10.17 217.30 (0.16) 315.68 (0.24) 9.15
TX2, MaxP FP32, b=2 87.84 (66.94) 25.01 17.89 249.21 (0.19) 354.04 (0.27) 9.37
TX2, MaxP FP32, b=4 87.84 (66.94) 45.84 32.72 272.61 (0.20) 382.77 (0.29) 9.56
TX2, MaxP FP32, b=8 87.84 (66.94) 86.85 61.95 290.61 (0.22) 404.57 (0.30) 9.71
TX2, MaxP FP32, b=16 87.84 (66.94) 169.26 120.61 295.84 (0.22) 415.80 (0.31) 9.75
TX2, MaxP FP32, b=32 87.84 (66.94) 330.80 237.45 302.80 (0.23) 421.99 (0.32) 9.77
TX2, MaxP FP32, b=64 87.84 (66.94) 687.11 470.42 288.46 (0.22) 425.71 (0.32) 9.50
TX2, MaxP FP32, b=128 87.84 (66.94) 1426.28 932.76 279.73 (0.21) 427.71 (0.32) 9.16
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
1:34 M. Blo et al.
Table 16. Level 1 and 2 - Discrepancy between latency of dierent convolutions and residual layers
Level 2 Level 1
Residual TX2, MaxN, FP16 Conv. TX2, MaxN, FP16 ZCU104,INT8
Layer [MOP] b=1 [ms] b=128 [ms] Layer [MOP] b=1 [ms] b=128 [ms] t=1 [ms] t=8 [ms]
res2a 462.44 1.37 119.12 res2a branch2a, 1x1 25.70 0.06 5.05 0.06 0.08
res2b 436.74 1.12 108.24 res2a branch2b, 3x3 231.20 0.19 22.78 0.19 0.19
res2c 436.74 1.12 108.07 res2a branch2c, 1x1 102.80 0.18 20.15 0.22 0.26
res3a 590.88 1.39 133.09 res2a branch1, 1x1 102.80 0.21 23.66 0.43 0.46
res3b 436.74 1.03 87.72 res3a branch2a, 1x1 51.40 0.09 7.19 0.09 0.13
res3c 436.74 1.05 87.87 res3a branch2b, 3x3 231.20 0.21 24.63 0.21 0.21
res3d 436.74 1.04 88.66 res3a branch2c, 1x1 102.80 0.15 15.18 0.21 0.25
res4a 590.88 1.20 104.66 res3a branch1, 1x1 205.50 0.29 30.35 0.33 0.39
res4b 436.74 1.03 71.33 res4a branch2a, 1x1 51.40 0.08 7.10 0.12 0.13
res4c 436.74 1.03 72.52 res4a branch2b, 3x3 231.20 0.20 23.12 0.21 0.23
res4d 436.74 1.02 72.01 res4a branch2c, 1x1 102.80 0.15 13.01 0.29 0.38
res4e 436.74 1.02 72.39 res4a branch1, 1x1 205.50 0.28 29.23 0.43 0.50
res4f 436.74 1.02 71.76 res5a branch2a, 1x1 51.40 0.14 7.61 0.12 0.19
res5a 590.88 1.73 95.61 res5a branch2b, 3x3 231.20 0.31 24.90 0.33 0.49
res5b 436.74 1.24 61.55 res5a branch2c, 1x1 102.80 0.27 12.53 0.47 0.60
res5c 436.74 1.23 61.58 res5a branch1, 1x1 205.50 0.51 30.92 0.52 0.69
Min 1.02 61.55 Min 0.06 5.05 0.06 0.08
Max 1.73 133.09 Max 0.51 30.92 0.52 0.69
Var 0.04 454.94 Var 0.01 79.42 0.02 0.03
ACM Journal on Emerging Technologies in Computing Systems, Vol. 1, No. 1, Article 1. Publication date: September 2016.
