In recent years, advances in deep learning have resulted in unprecedented leaps in diverse tasks spanning from speech and object recognition to context awareness and health monitoring. As a result, an increasing number of AI-enabled applications are being developed targeting ubiquitous and mobile devices. While deep neural networks (DNNs) are getting bigger and more complex, they also impose a heavy computational and energy burden on the host devices, which has led to the integration of various specialized processors in commodity devices. Given the broad range of competing DNN architectures and the heterogeneity of the target hardware, there is an emerging need to understand the compatibility between DNN-platform pairs and the expected performance bene ts on each platform. This work attempts to demystify this landscape by systematically evaluating a collection of state-of-the-art DNNs on a wide variety of commodity devices. In this respect, we identify potential bottlenecks in each architecture and provide important guidelines that can assist the community in the co-design of more e cient DNNs and accelerators.
INTRODUCTION
With a demonstrated state-of-the-art accuracy in a wide range of AI tasks, the popularity of deep neural networks (DNNs) is on the rise. Since 2012 and the introduction of AlexNet [17] , a myriad Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. EMDL '19, Seoul of models have been competing for improved predictive power (Table 1) . Nevertheless, accuracy gains have often been achieved at the expense of an increase in model complexity, inference time and resource requirements. With DNN models becoming ubiquitous across multiple scenarios and compute devices, from largescale cloud services [7] to resource-constrained mobile systems [18] , predicting the processing performance of each DNN becomes a challenging task. Given the wide range of competing DNN architectures and the heterogeneity of the target hardware, there is an upcoming desire to gain insights on how and why di erent design decisions impact the accuracy and performance of these networks upon deployment. EmBench aims to provide a comprehensive analysis of widely used deep neural networks, with a focus on evaluating which models thrive under which target platforms, while identifying their bottlenecks and sources of ine ciency. To this end, we analyze a set of popular DNNs (Table 1) targeting various compute platforms  (Table 2) . First, we perform a macro analysis of these networks in terms of their complexity, inference latency and throughput for multiple batch sizes. Then, we perform a deeper analysis of the most prominent network operations, with a focus on detecting nontrivial di erences between execution across the target platforms.
More speci cally, we provide the following contributions:
• We demonstrate that di erent networks are handled quite di erently by each target platform, making it challenging to design an e cient model in a hardware-agnostic manner.
• We provide insights about how di erent batch sizes can affect performance on ve di erent hardware architectures.
• We analyze the inference latency of all networks and show that the trade-o between actual processing speed and accuracy depends on the underlying hardware and its optimizations.
• We break down the overall DNN workload into individual operations and unveil any opportunities for further improvements on each platform.
RELATED WORK
So far, a few studies have focused on analyzing the system-level properties of DNNs on deployment platforms. Canziani et al. [4] presented a system-level analysis of 14 convolutional neural networks (CNNs) on the NVIDIA Jetson TX1 platform. Despite the fact that the analysis spanned across multiple metrics, the study was conducted over a limited number of networks and targeted solely a single platform. Bianco et al. [2] extended the covered space by evaluating a wider range of networks and targeting one embedded (Jetson TX1) and one high-end compute platform (NVIDIA Titan X GPU). Both studies conducted an analysis of the selected networks across multiple dimensions, including accuracy, compute speed, memory footprint and power consumption. Nevertheless, by including a total of two platforms -and given the heterogeneity of currently available devices-the presented insights are not directly transferable to platforms with di erent characteristics. In this work, we expand to a broad range of both high-and low-end devices, spanning from the latest server-grade RTX 2080 Ti GPU and the embedded Nvidia Jetson AGX down to the mobileready Qualcomm mobile Kryo 385 CPU and the low-power Intel Neural Compute Stick 2. Furthermore, we present a microscopic view of how well di erent layer types are mapped to each hardware architecture, aiming to provide insights for the hardware-aware design of novel DNNs.
On a slightly di erent setting, Huang et al. [13] concentrated on the task of object detection and evaluated a wide set of CNN-based object detectors in terms of processing performance and detection accuracy. With a focus on the mobile space, Ignatov et al. [15] assembled a benchmark suite of representative AI tasks to assess the processing capabilities of currently available smartphones. In this paper, we adopt a wider scope than [13] by treating network architectures in a task-agnostic manner and target more diverse families of devices compared to [15] .
Last, Zhang et al. [28] study the key performance di erences among di erent machine learning frameworks across di erent edge platforms. We treat the framework as an invariant and focus our endeavours on the inference behaviour of the devices at hand for a signi cantly greater variety of models. 
HARDWARE PLATFORMS
The large compute and memory demands of modern DNN workloads have led to an emergence of specialized processors with the goal to facilitate their high-performance deployment. Depending on the target scenario, each platform has employed di erent hardware optimizations to satisfy system-level constraints, including latency, throughput, temperature, power dissipation and form factor. In desktop and server environments, high-end devices are typically employed in order to maximize throughput at the penalty of substantial power consumption. In this context, powerful -and massively parallel-but power-costly platforms have been designed. A representative example is the latest NVIDIA GeForce RTX 2080 Ti GPU which is based on the NVIDIA Turing architecture. By introducing the 2nd generation of Tensor Cores -a set of specialized hardware units tailored for DNN processing-this particular GPU provides hardware support for 16-bit oating-point as well as 8-and 4-bit xed-point precision and enables the highly optimized execution of matrix operations with mixed-precision arithmetic.
Having a large bandwidth of 616 GB/s to a 11 GB o -chip RAM, the RTX 2080 Ti GPU has been optimized for high throughput, reaching its peak performance when processing inputs in batches, yielding up to 13.4 (FP32) TFLOP/s at the cost of a 250-watt peak power.
On the other end of the spectrum, severely power-constrained systems such as IoT and mobile devices are equipped with processing units that comply with the respective thermal and formfactor constraints. In this space, low-power miniaturized SoCs, such as the Qualcomm Snapdragon™ 845 (SDM845), have been explicitly designed with the objective to provide the processing support for on-device DNNs while respecting the typical <10 watts power budget of modern consumer and robotic systems. Further towards customization, neural accelerators o er sub-watt solutions for rapid DNN inference in severely constrained embedded systems. To this end, full-custom chips have been designed with the goal to extract the highest possible performance at a minimal power dissipation. A representative and widely available instance is the Intel Neural Compute Stick (NCS) 2 mounting the Movidius Myriad X accelerator, delivering up to 1 TOP/s under 1 watt.
Targeting a mid-range power envelope of a few 10s of watts which is typical in complex embedded systems such as driverless cars and robots, several devices have been designed to reach a con gurable trade-o between power consumption and processing speed. A representative example is the latest NVIDIA Jetson Xavier SoC. Jetson Xavier hosts a 512-core NVIDIA Volta GPU with 64 1st generation Tensor Cores. To support di erent power budgets while sustaining a high performance, the platform can be con gured with a range of frequencies up to 1.3 GHz at a peak TDP of 30 watts. Given the diversity of DNN-enabled applications, the variability across deployment scenarios, network architectures and hardware platforms leads to an emerging need for evaluating the compatibility between di erent network-platform pairs. In this respect, Section 4 examines the e ciency of mapping a substantially comprehensive range of networks (Table 1) on each platform of Table 2 with the potential to guide both novel network and hardware designs.
EVALUATION
In this section, we analyze the data that we have collected by benchmarking the set of networks from Table 1 targeting various commodity compute platforms (Table 2 ). Each con guration comprises a choice of i) DNN, ii) batch size and iii) target platform. For each con guration, we initially perform a macroscopic analysis with respect to complexity, measured in number of oating-point operations, and achieved latency on the selected device. We further analyze the latency of each operation type within the network to identify processing bottlenecks and compare the computational e ciency across the target devices.
Benchmarking procedure. Following the e ort of the Open Neural Exchange Format (ONNX) to provide a uniform interface among deep-learning frameworks, we adopted Facebook's machinelearning toolchain (PyTorch, Ca e2, FAI-PEP) for the majority of our experiments due to its imperative interface and support for ONNX. First, the pretrained versions of all DNNs were collected in a PyTorch format. Speci cally, for the experiments on a workstation hosting the Xeon 4116 CPU and RTX 2080 Ti GPU, and on the NVIDIA Jetson Xavier SoC, PyTorch 1 v1.0.0 with CUDA 10 and cuDNN 7.5 are used directly. The two platforms run GNU/Linux Ubuntu 18.04 LTS, compiled for x86-64 and 64-bit ARM, respectively. For the Xavier SoC, we set the GPU frequency to its maximum setting to obtain the peak performance. On the mobile side, the evaluated networks were converted to Ca e2 to run on SDM845's 1 https://pytorch.org/ Kryo CPU, running Android 9 (Pie). The Ca e2 backend for mobile platforms was con gured to employ the highly optimized implementations of the NNPACK 2 package. To systematically measure on-device performance, we employ Facebook's AI Performance Evaluation Platform 3 (FAI-PEP). Finally, when targeting Intel NCS 2, the networks were converted to ONNX and subsequently to the Intel Movidius graph le format through the Intel OpenVINO™ toolkit. To time the execution of DNNs on NCS 2, we use the program counters of the Myriad X chip. Across all platforms, each experiment is run 50 times -with ten warm-up runs for uniform initial cache state and a cool-o period of 2 seconds between runs to avoid frequency scaling due to heatand the minimum latency achieved is reported.
FLOPs analysis. Popular DNNs tend to vary quite a lot in terms of their complexity. As seen on Table 1 , some networks require up to 3 orders of magnitude more oating-point operations (FLOPs) to perform image recognition over the same dataset (e.g., ImageNet [10] ), often with little to no improvement in accuracy, such as in the case of MobileNetV2 vs. VGG16.
We rst investigate if the FLOPs are a good indicator of inference time. Figure 1 shows how the actual inference time on the 2080 Ti GPU varies with respect to FLOPs. The amount of FLOPs on the x-axis is the network's FLOPs times the batch size. We used powers of two for the batch sizes up to 512. We can see that for networks with similar FLOPs, the actual GPU time can vary by at least one order of magnitude. Nonetheless, for the same network, the GPU time grows slowly with the batch size up to a point where time growth becomes almost exponential. At this point, the GPU resources are fully utilized and the cost of further batching is no longer amortized.
Impact of batch size. Figure 2 depicts the e ect of batch size on the achieved throughput across models and platforms. On the one side of the spectrum, given the memory-abundant workstation setup, the highest performing batch sizes for the high-end 2080 Ti GPU typically lie between 128 and 256. At this point, the GPU reaches the peak utilization of its resources and, consequently, after that the overhead of further batching is no longer amortized. In this case, the GPU is able to exploit the intra-batch parallelism of large batches and boost the achieved throughput, while not being limited by the available RAM. On the other hand, although similarly to the high-end GPU there are no signi cant memory constraints, the Xeon 4116 CPU rarely improves its throughput after a batch size of 32, where typically its resources -which are less than the resource-rich GPU's-are already fully utilized.
On the other side of the spectrum, large batches tend to exhaust the memory-limited devices. In this respect, the mobile Kryo 385 CPU (CPU-SDM845 in Figure 2 ) shows modest throughput improvement for increasing batch size due to both the reduced memory bandwidth and amount of compute resources compared to its high-end counterparts, and its maximum batch size is typically up to 32 due to the small o -chip memory capacity. Note that using NNPACK's CPU optimized convolution layers (CPU-SDM845-NNPACK in Figure 2 ) improves the number of inferences per second by 2.97× on average and up to 4.53× for densenet121.
Moreover, NCS 2 outperforms its mobile counterpart across the evaluated networks, despite the fact that it was designed to operate with a batch size of 1. Finally, the embedded Tegra Xavier GPU demonstrates a similar pattern to RTX 2080 Ti by being able to exploit batch processing to improve its throughput, although its median optimal batch size is around 64. The highest throughput is observed at the point where its resources are fully utilized, after which further batching degrades the achieved throughput due to excessive uncompensated overhead.
Accuracy vs. time. Figure 3 shows the accuracy and achieved latency of selected representative networks as measured across the evaluated platforms. To investigate the trade-o between accuracy and achieved processing speed, we analyze the inference latency (i.e., batch size of 1) against the accuracy of these networks. The results ( Figure 3 ) demonstrate that there are signi cant di erences among the evaluated hardware platforms.
In the workstation setting, we observe that the server-grade CPU is sub-optimal when it comes to supporting very deep networks (e.g., the ResNet family) or those with large number of FLOPs (e.g., VGG). As a result, none of the state-of-the-art networks could sustain more than 15 FPS. Instead, the RTX 2080 Ti achieves a throughput improvement over the Xeon 4116 CPU in the range 7×-50× with an average of 15× across the networks. In particular, networks such as ResNet yield a speedup of ≈ 14×, whereas simpler networks such as AlexNet can handle an impressive 680 FPS. Similarly, with respect to power e ciency, RTX 2080 Ti yields an average improvement of 5.3× in inferences/s/W over the Xeon CPU.
In the <30 watts range, the Tegra Xavier GPU manages to outperform the Xeon CPU with raw speedups between 2×-15× (5.1× average) and yields an average power e ciency gain of 14.4× across the networks. Compared to the RTX 2080 Ti GPU, although the Tegra Xavier GPU reaches between 26-37% (34% average) of its raw performance, it achieves a 2.8× average improvement in power e ciency, demonstrating its suitability for applications when inferences/s/W are the primary metric of interest.
Due to their resource constraints, the Kryo mobile CPU exhibits substantially lower raw performance when compared to the servergrade CPU and GPU platforms; NCS 2 can handle up to 23 FPS on average whereas the Kryo processor achieves less than 6 FPS (35× slower than the RTX 2080 Ti GPU). Despite the expected degradation of the mobile CPU in terms of raw throughput, NCS 2 achieves a power e ciency in the range 3×-88× (41.6×) over the power-hungry RTX 2080 Ti GPU. In this respect, the NCS 2 platform extracts the maximum performance out of its 1-watt TDP and constitutes a powerful candidate device for performing DNN inference in very-low-power settings.
Interestingly, by observing the accuracy-latency Pareto fronts in Figure 3 , mobile devices demonstrate substantially di erent patterns compared to the server-grade experiments. First, deeper networks with large number of operations su er from a big penalty in raw performance. For instance, ResNet on the mobile CPU results in a minimal throughput of 0.2 inferences-per-second which is 250× slower than the RTX 2080 Ti GPU. At the same time, networks that are optimized for mobile devices do signi cantly better: MobileNetV2 is 6.4× faster on NCS 2 and 16.3× faster on the Kryo processor compared to ResNet. Despite the accuracy penalty, the design decision of using depthwise separable convolutions seems to o er a considerable performance bene t in the mobile space. At the same level of accuracy, VGG16 is uniformly slower across devices by a considerable margin (ranging from 2× to 6×). Finally, Shu eNetV2 is a notable example that achieved signi cantly improved performance on Kryo CPU when compared to server-grade runs, demonstrating the bene ts of pointwise group convolution and channel shu ing for resource-constrained devices.
Overall, from a high-level view, the Pareto fronts of low-power devices (i.e., sub-gures on second row of Figure 3 ) comprise di erent networks compared to their more power-consuming counterparts. The Pareto front of the Xeon 4116 CPU comprises AlexNet, MobileNetV2 and ResNet152, while the RTX 2080 Ti and Tegra Xavier GPUs contain AlexNet, VGG16 and ResNet152. On the other hand, NCS 2 also includes DenseNet201 on its Pareto front, with VGG16 being excluded due to its very latency-expensive NCS 2 mapping, while Shu eNetV2 and NASNet-mobile also appear as Pareto optimal networks for the CPU of SDM845. As a result, the direct use of hardware-agnostic metrics, such as number of FLOPs, can often be misleading and not accurately indicate how e ciently a DNN is mapped on a particular platform.
Per-layer analysis. To further investigate into why some networks are mapped more e ciently on certain platforms, we also look into the breakdown of inference time within each operation ( Figure 4) . As already demonstrated in the literature, the majority of time is spent on convolution operations ranging from 65% of the time in desktop GPU to 89% of the time on mobile processors. This di erence further demonstrates the need for optimizing these operations on mobile platforms. For NVIDIA GPU accelerators, the second most time-consuming operation for this workload was Batch Normalization (19% and 15% of time on the RTX 2080 Ti and Xavier GPUs respectively) whereas for the Xeon CPU Max Pool becomes substantial by occupying 10% of the time. Finally, on the Kryo mobile processor 10% of the time is spend on fullyconnected layers. AlexNet fully reveals that fully-connected layers are not a good t for the characteristics of mobile platforms, taking more than 70% of the computation time when compared to below 30% on server-grade GPUs. The primary factor behind the slow execution of fully-connected layers is the limited o -chip memory bandwidth of existing mobile platforms which determines the processing speed of the inherently memory-bound operations of fully-connected layers.
DISCUSSION
The analysis of Section 4 uncovers a number of notable insights.
Performance variability: We note that the performance of each network varies substantially across platforms depending on the network architecture and the type of operations used. Therefore, an interesting research direction would be to design tools that can automatically select when to use and fuse together these building blocks, depending on the hardware architecture of the target device as well as the latency and throughput requirements [1, 3, 9] .
Mobile-speci c optimizations: Our results further demonstrate the importance of mobile-speci c operations such as pointwise and depthwise separable convolutions. Furthermore, our benchmarks show that mobile devices are ine cient in handling larger models due to their reduced memory capabilities. Therefore, while compression and quantization techniques might not result in a big performance gain on desktop environments, they do make a big di erence on mobile and embedded devices, both computeand memory-wise. Towards this direction, binary networks [8, 22] seem to o er a promising alternative for maximal compression, but currently require specialized hardware support to exploit the speedup potential. Finally, it is possible to dynamically o oad computation from device to the edge or cloud, in order to facilitate computation and minimize latency [5] .
Importance of hardware support: Most of the examined mobile devices either do not come with automated libraries for targeting their accelerator backends (GPU, DSP, NPU), or provide limited support for DNN operators. Our results demonstrate that most of the time is consumed on operations such as convolutions and fullyconnected layers. Instead of being limited to the CPUs, software optimizations and support for exploiting the available hardware accelerators of target mobile and embedded chipsets should be prioritized in order to accelerate these common operations in a transparent and homogeneous way [27] .
Batch size: While most devices are optimized for larger batch sizes, most real-time mobile applications require smaller batch sizes to minimize latency. We observe that across all platforms the hardware is not fully utilized for smaller batch sizes. One possible direction is to study how multiple networks with small batch sizes can be optimally collocated to overprovision these resources and thus maximize their utilization [16, 21] .
CONCLUSIONS
In this work, we attempted to shed some light on the performance of DNN inference by analyzing more than 25 networks on a wide variety of commodity devices. Our results provide useful insights about the performance and suitability of these models. Furthermore, we identi ed model design choices that work better on each platform and we uncovered a number of performance bottlenecks. We believe that these results can help the research community in three ways: i) to identify the best possible building blocks when designing models for these platforms, ii) shed light on the capabilities of these devices and iii) provide insights about possible future hardware and software optimizations.
