The Pitfall of Evaluating Performance on Emerging AI Accelerators by Jiang, Zihan et al.
The Pitfall of Evaluating Performance on
Emerging AI Accelerators
Zihan Jiang1,2?, Jiansong Li1,2?, and Jianfeng Zhan1,2
1 University of Chinese Academy of Sciences, Beijing
2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing
{jiangzihan, lijiansong, zhanjianfeng}@ict.ac.cn
Abstract. In recent years, domain-specific hardware has brought signif-
icant performance improvements in deep learning (DL). Both industry
and academia only focus on throughput when evaluating these AI ac-
celerators, which usually are custom ASICs deployed in datacenter to
speed up the inference phase of DL workloads. Pursuing higher hard-
ware throughput such as OPS (Operation Per Second) using various op-
timizations seems to be their main design target. However, they ignore
the importance of accuracy in the DL nature. Motivated by this, this
paper argue that a single throughput metric can not comprehensively
reflect the real-world performance of AI accelerators. To reveal this pit-
fall, we evaluates several frequently-used optimizations on a typical AI
accelerator and quantifies their impact on accuracy and throughout un-
der representative DL inference workloads. Based on our experimental
results, we find that some optimizations cause significant loss on accuracy
in some workloads, although it can improves the throughout. Further-
more, our results show the importance of end-to-end evaluation in DL.
Keywords: Deep Learning · Domain-specific Hardware · Performance
Evaluation.
1 Introduction
Deep learning (DL) has revolutionized many challenge AI domains, such as image
recognition [12,20] and natural language processing [25,28]. At the same time,
the progressively larger DL models and datasets are getting more computation-
ally expensive to train or inference. To keep up with the growing computational
demand in modern DL workloads, hardware specialization has become a signif-
icant way [13,19,3,6,22]. However, maybe the gap between architecture and DL
domains, these AI accelerators usually aim to provide higher throughput (e.g.
OPS) and ignore the accuracy, which is the most important metric in DL. The
performance numbers of mainstream AI accelerators are summarized in Table
1. Based on our investigation, throughput is the main concern in the industry
world. INT8 quantization is their main optimization technique that brings higher
throughput while saving power and memory. In addition, inference is the most
significant application of AI accelerators so far.
? Equal contribution
ar
X
iv
:1
91
1.
02
98
7v
1 
 [c
s.P
F]
  8
 N
ov
 20
19
2 ZH. Jiang and JS. Li
Table 1: The performance numbers of mainstream AI accelerators
AI Accelerators Producers Performance Numbers Memory Power Application
TPU V1 [7] Google 92 TOPS INT8 8GB 75W Inference
Hanguang 800 [1] Alibaba 78563 IPS INT8 Unknown 500 IPS/W Inference
MUL100-C [2] Cambricon 128 TOPS INT8 8GB 75W Inference
Atlas 300 [15] Huawei 64 TOPS INT8 32GB 67W Inference
MLU270-S4/F4 [2] Cambricon 128 TOPS INT8 16GB 70W/150W Inference
TPU V2/V3 [8] Google 180 TFLOPS/420 TFLOPS 64GB/128GB Unknown Training and Inference
Previous work MLPerf [23] and DawnBench [4] presents the time-to-accuarcy,
which is a metric to measure the training time to a target accuracy, to emphasis
the necessity of accuracy in evaluating DL training. In this paper, we general-
ize the evaluation of accuracy to DL inference phase, measuring the end-to-end
throughput while having accuracy constraint. From these two evaluation per-
spectives, we conduct a series of experiments on a typical AI accelerators, called
ACC-1. Under the representative DL inference workloads, we quantifies the im-
pact of several frequent-used optimizations on accuracy and throughput and find
that some optimizations cause significant loss on accuracy in some DL workloads,
although it can improves the throughout. Under the same dataset, different DL
models suffer different degrees of accuracy loss. Furthermore, our results show
the importance of end-to-end evaluation in DL.
The main contribute of this paper is revealing the pitfall of evaluating emerg-
ing AI accelerators. To be specifically, a single throughput metric can not com-
prehensively reflect the real-world performance of AI accelerators as the accuracy
is not negligible.
2 Background
2.1 Hardware Characteristics
Our experiment platform is a custom ASIC deployed in datacenter to acceler-
ate the inference phase. The general architecture is shown in Fig 1. This ac-
clerators is based on the multi-core architecture. It includes four channels con-
nected via a network on chip (NOC). Each channel contains one DDR and eight
computational cores. For example, Channel0 contains one DDR memory con-
troller (DDR0) and eight computational cores, namely C0, C1, ..., C7. DDR is
responsible for the storage of DNN model, input and output of DL workloads.
While those computational cores perform the execution of DNN computation
tasks.
2.2 Software Stack
Fig 2 shows the software stacks of ACC-1. As we all know, Caffe [18] is an open-
sourced software framework used for DNN training and inference. It is written
The Pitfall of Evaluating Performance on Emerging AI Accelerators 3
Fig. 1: The architectural of ACC-1
in C++ and widely adopted in research experiments and industry deployments.
ACC-1 provides Caffe as its high-level programming framework. Application
programmers can simply deploy their applications via ACC-1 Caffe. CNRT is
the runtime toolkit of ACC-1. It provides some common low-level utility APIs,
such as device and memory management, kernel launch, task queue scheduler
and etc. CNML is a wrapper of CNRT. It provides some helper functions for
DNN model loading and execution and common highly-tunned DNN operators,
e.g., convolution and pooling operators. Driver and kernel is responsible for the
handling of memory management and interrupts.
Fig. 2: Software Stacks
2.3 Frequently-used Optimizations
The software stack provides some common utilities for optimization, such as par-
allelism, model parallelism and data pipeline. These approaches are not mutually
exclusive.
Data Parallelism. In the inference phase of DL workloads, data parallelism
means that given a CNN model, the input data is partitioned and assigned to
different computational cores. As is shown in Fig. 3(a), different cores have a
complete copy of the DNN model. Each core simply gets a different part of the
4 ZH. Jiang and JS. Li
input data, and results from each core are somehow combined to get the final
output. Data parallelism can greatly improve the throughout, since different
parts of the input data can be executed concurrently.
Model Parallelism. As is shown in Fig. 3(b), model parallelism means that
different cores are responsible for the computations of different parts in a single
network. For example, each layer in the neural network may be assigned to a
different core. In the DL domain, we can divide a neural network into several
subnets, then put each subnet into different cores of ACC-1. Model parallelism
can also improve the throughout, since for a single input, different parts of the
DNN model can be executed concurrently.
Fig. 3: Illustration of data parallelism and model parallelism.
Data Pipeline. In the inference phase of DL workloads, the input data flow
will be fetched into host memory from the disk, and then they will be transferred
into the device memory of ACC-1. Finally they will be feed to the computational
cores of ACC-1. In this case, data pipeline can improve the workload balance of
data prefetching, transferring and data feeding.
Weights Pruning and Quantization.As the large amounts of synaptic weights
incur intensive computation and memory accesses in the inference phase of DL
workloads, researchers have proposed a number of effective techniques to ex-
plore the sparsity of DNN, including weight pruning, model compression and
quantization [29,9,10,16]. ACC-1 tries to exploit the sparsity and irregularity of
DNN models for the performance and power efficiency. It provides tools to prune
weights of input DNN model and quantize the DNN weights into low-precision
fixed-point numbers, e.g., INT8. We will discuss the effects of these optimization
techniques over performance and accuracy in section 3.
3 Performance Numbers
This section presents our performance numbers. As this work focus on the evalu-
ation of accuracy in DL inference, we only analysis the optimizations that influ-
ence the accuracy. To be specifically, INT8 quantization and weight pruning are
The Pitfall of Evaluating Performance on Emerging AI Accelerators 5
Fig. 4: Illustration of data pipeline. Note that to improve the throughput, we can
launch multiple threads to read data from disk into CPU memory and then dis-
patch the computational tasks into a queue that will be executed asynchronously.
For those computational tasks within the same queue, they will be executed by
their dispatching FIFO order. While those inter-queue tasks will be executed
concurrently.
our main evaluation targets. Other optimizations such as parallelism and data
pipeline, we don’t provide specific analysis here. In the after-mentioned sections,
our experiments are under the same configuration in term of other optimization
techniques. Please note all of our experiments are implemented based on the
customized caffe provided by the ACC-1 producers. The optimization details
such as the implementation of quantization or pruning are beyond our scope.
3.1 Metrics
Our metrics are divided into two categories, namely throughput and accuracy.
Throughput Metrics. We adopt hardware FPS and end-to-end FPS to mea-
sure the throughput. FPS means frames per seconds. In our experiments, frame
is essentially image.
TOP-1 Accuracy. This metric means that the model answer (the one with
the highest probability) must be exactly the expected answer.
3.2 Benchmark
Our benchmark is summarized in the table 2. In addition to the models and
datasets, we also emphasis the expected accuracy of FP16 version. Note that
the accuracy of these pre-trained models is not the state-of-the-art and doesn’t
represent the ability of corresponding models. We mainly use them to compare
the accuracy of INT8 version.
6 ZH. Jiang and JS. Li
Our quantization experiments are based on the ImageNet dataset [5] and
cover all of the models mentioned in the following table. The weight pruning
experiments now only performed under the ResNet-50 and CIFAR10 dataset.
Table 2: The benchmark specification.
Datasets Models FP16 Accuracy Datasets Models FP16 Accuracy
ImageNet [5] AlexNet [21] 55.816% ImageNet ResNet-101 72.988%
ImageNet GoogleNet [26] 68.002% ImageNet ResNet-152 72.972%
ImageNet Inception-V3 [27] 71.570% ImageNet SqueezeNet [17] 57.076%
ImageNet MobileNet [14] 67.140% ImageNet VGG16 [24] 67.560%
ImageNet ResNet-18 [11] 64.710% ImageNet VGG19 70.234%
ImageNet ResNet-34 71.100% CIFAR10/ImageNet ResNet-50 73.024%/84.390%
3.3 INT8 Quantization
Accuarcy. As shown in Fig. 5, the inference accuracy of all of the workloads is
reduced. The loss of accuracy in AlexNet, ResNet-18 and SqueezeNet is greater
than 0.6%, which cannot be ignored in the DL field. Strikingly, the loss in
GoogleNet and MobileNet achieve 1.388% and 1.254%. Although the results of
some workloads (e.g. ResNet-101) is negligible, they are only about 60 percent
of the whole benchmarking workloads.
Fig. 5: The accuracy loss after INT8 quantization compared to FP16 version.
Throughput. As is shown in Fig. 6a, compared with FP16, INT8 quantization
can significantly improve hardware throughput because of lower memory storage
The Pitfall of Evaluating Performance on Emerging AI Accelerators 7
overhead. However, higher hardware throughput doesn’t mean higher end to end
throughput. Because the load balance of data feeding between host CPUs and
AI accelerators will affect the end to end throughput directly. Therefore, we can
see that in Fig. 6b(b), INT8 quantization can even degrade the end to end FPS
of some DNN models, e.g., Inception-V3 and SqueezeNet.
(a) The impact of INT8 quantization on
hardware FPS.
(b) The impact of INT8 quantization on
end-to-end FPS.
Fig. 6: The impact of INT8 quantization on throughput.
3.4 Weights Pruning
As shown in Fig. 7, weights pruning can significantly affect the end-to-end and
hardware throughput. In Fig. 7b, the hardware FPS increases when the input
weight sparsity is higher and higher, since higher sparsity means more zeros in
the weights data and higher throughput of the accelerators. However, in Fig. 7a,
with the weights sparsity increases, the end to end FPS improvement slows
down because of the load imbalance of data feeding between host CPUs and AI
accelerators.
(a) The impact of sparsity on accuracy and
End-to-end FPS.
(b) The impact of sparsity on accuracy and
hardware FPS.
Fig. 7: The impact of weight pruning.
8 ZH. Jiang and JS. Li
4 Future work
4.1 Workloads Variability
Our experiments are preliminary and limited in CNN models of image recogni-
tion. In order to cover the most representative AI applications, our benchmark
should include more DL workloads such as RNN and transformer [28].
4.2 Cross-platform Evaluation
So far, we only evaluated the optimizations on ACC-1. Cross-platform evaluation
is good for finding the most suitable platform based on the models of interest.
Furthermore, It’s also help the producers find their design deficiencies.
4.3 The quality of The Pre-trained Models
Intuitively, the quality of the pre-trained models influence our evaluation results
especially accuracy to some extent. However, training the state-of-the-art models
is not easy. Even if there are many state-of-the-art pre-trained models on the
internet, porting them to the customized platforms of various AI accelerators is
still an engineering-heavy work.
5 Conclusion
We reveals a pitfall in evaluating the performance of emerging AI accelerators,
that is, a single throughput metric such as OPS can not comprehensively re-
flect the real-world performance of AI accelerators as the accuracy is ignored.
On a typical AI accelerator platform, we quantify the impact of INT8 quan-
tization and weights pruning—two frequently-used optimizations for improving
the throughput—on accuracy. Our results show that INT8 quantization causes
significant loss on accuracy in some representative DL inference workloads. We
highlight the importance of end-to-end evaluation in DL. Particularly, high hard-
ware throughput does not mean high end-to-end throughput due to the influence
of data feeding between host CPUs and AI accelerators.
References
1. Alibaba: Announcing Hanguang 800: Alibaba’s First AI-
Inference Chip. https://www.alibabacloud.com/blog/
announcing-hanguang-800-alibabas-first-ai-inference-chip_595482
2. Cambricon: Cambricon MLU100. http://www.cambricon.com/index.php?c=
page&id=20
3. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A
small-footprint high-throughput accelerator for ubiquitous machine-learning. In:
ACM Sigplan Notices. vol. 49, pp. 269–284. ACM (2014)
The Pitfall of Evaluating Performance on Emerging AI Accelerators 9
4. Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., Nardi, L., Bailis,
P., Olukotun, K., Re´, C., Zaharia, M.: Dawnbench: An end-to-end deep learning
benchmark and competition. Training 100(101), 102 (2017)
5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition. pp. 248–255. Ieee (2009)
6. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y.,
Temam, O.: Shidiannao: Shifting vision processing closer to the sensor. In: ACM
SIGARCH Computer Architecture News. vol. 43, pp. 92–104. ACM (2015)
7. Google: An in-depth look at Googles first Tensor Process-
ing Unit (TPU). https://cloud.google.com/blog/products/gcp/
an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
8. Google: What Makes TPU Fine Tuned to Deep Learning.
https://cloud.google.com/blog/products/ai-machine-learning/
what-makes-tpus-fine-tuned-for-deep-learning
9. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: Eie:
Efficient inference engine on compressed deep neural network. International Con-
ference on Computer Architecture (ISCA) (2016)
10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net-
works with pruning, trained quantization and huffman coding. International Con-
ference on Learning Representations (ICLR) (2016)
11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
CoRR abs/1512.03385 (2015), http://arxiv.org/abs/1512.03385
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
13. Hennessy, J.L., Patterson, D.A.: A new golden age for computer architecture. Com-
mun. ACM 62(2), 48–60 (2019)
14. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications. CoRR abs/1704.04861 (2017), http://arxiv.org/
abs/1704.04861
15. Huawei: Atlas 300 AI Accelerator Card. https://e.huawei.com/en/products/
cloud-computing-dc/atlas/atlas-300-ai
16. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neu-
ral networks: Training neural networks with low precision weights and activations.
J. Mach. Learn. Res. 18(1), 6869–6898 (Jan 2017), http://dl.acm.org/citation.
cfm?id=3122009.3242044
17. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.:
Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model
size. CoRR abs/1602.07360 (2016), http://arxiv.org/abs/1602.07360
18. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding.
arXiv preprint arXiv:1408.5093 (2014)
19. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S.,
Bhatia, S., Boden, N., Borchers, A., et al.: In-datacenter performance analysis of a
tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium
on Computer Architecture (ISCA). pp. 1–12. IEEE (2017)
20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in neural information processing systems.
pp. 1097–1105 (2012)
10 ZH. Jiang and JS. Li
21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Proceedings of the 25th International Conference on
Neural Information Processing Systems - Volume 1. pp. 1097–1105. NIPS’12, Cur-
ran Associates Inc., USA (2012), http://dl.acm.org/citation.cfm?id=2999134.
2999257
22. Liu, S., Du, Z., Tao, J., Han, D., Luo, T., Xie, Y., Chen, Y., Chen, T.: Cambricon:
An instruction set architecture for neural networks. In: ACM SIGARCH Computer
Architecture News. vol. 44, pp. 393–405. IEEE Press (2016)
23. Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micikevicius, P., Patterson, D.,
Tang, H., Wei, G.Y., Bailis, P., Bittorf, V., et al.: Mlperf training benchmark.
arXiv preprint arXiv:1910.01500 (2019)
24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2014), https://arxiv.org/abs/1409.1556
25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
networks. In: Advances in neural information processing systems. pp. 3104–3112
(2014)
26. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Er-
han, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR
abs/1409.4842 (2014), http://arxiv.org/abs/1409.4842
27. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the in-
ception architecture for computer vision. CoRR abs/1512.00567 (2015), http:
//arxiv.org/abs/1512.00567
28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
 L., Polosukhin, I.: Attention is all you need. In: Advances in neural information
processing systems. pp. 5998–6008 (2017)
29. Zhang, S., Du, Z., Zhang, L., Lan, H., Liu, S., Li, L., Guo, Q., Chen, T.,
Chen, Y.: Cambricon-x: An accelerator for sparse neural networks. In: The 49th
Annual IEEE/ACM International Symposium on Microarchitecture. pp. 20:1–
20:12. MICRO-49, IEEE Press, Piscataway, NJ, USA (2016), http://dl.acm.org/
citation.cfm?id=3195638.3195662
