Keywords-convolutional n eu ral netw orks, deep learn in g, edge in feren ce, em bedded vision , hardw are perform ance, softw are fram ew orks
I . I n t r o d u c t i o n
Deep Learning (DL) [1] has emerged as the reference paradigm for applications demanding accurate inference. In particular, concerning computer vision, Convolutional Neural Networks (CNN) are being employed for multiple tasks, rang ing from image recognition to pixel-wise segmentation. This versatility along with much higher accuracy in comparison with classical vision approaches come at the cost of notably increas ing the requirements for computational and memory resources [2] , This constitutes a major challenge for the implementation of CNNs in embedded systems [3] .
The relevance gained by the DL paradigm in the last few years has driven the development of several software frame works for prototyping and practical deployment of CNNs. While globally targeting the same functionality, each of these frameworks follows a particular approach and exploits spe cific libraries to deal with the massive computational load demanded by deep neural networks. For instance, matrix-matrix or matrix-vector operations, which are the backbone of CNNs, can be realized by Basic Linear Algebra Subroutines (BLAS) [4] [5] available in a number of libraries: Adas [6] , MKL [7] , OpenBLAS [8] [9] , Eigen [10] , cuBLAS [11] , etc. This diversity of strategies and tools result in remarkable different inference performance from DL frameworks, even when they are running the same CNN model on a common hardware platform. Direct modelling of this heterogeneous performance is unmanageable due to the complexity of the frameworks and their rapid evolution.
In this context, research on CNN is usually focused on a straightforward assessment. For instance, some works dissem inate a throughput comparison of various DNN frameworks [12] - [14] , including Caffe, TensorFlow, Torch, CNTK, MXNet, etc. Even so, actually focused on embedded platforms, fewer contributions have evaluated the efficiency of DNN software tools for computer vision at the edge [15] - [18] . All these benchmarks extract direct metrics from CNN inference. Al though more customized and specific CNN implementations on CPU-based embedded systems have been reported [19] [20], a generalized study should include popular DL frameworks that can operate on a wide range of embedded devices.
In this paper we explain performance of embedded DL in ference indirectly through metrics of hardware exploitation that can be easily measured, alternatively to such usually followed direct approach. The analysis is carried out on a Raspberry Pi 3 Model B [21] (RPi), an inexpensive embedded computer featuring a 4-core ARM Cortex A-53 CPU. We first report the performance achieved by four popular DNN frameworks in terms of throughput and CPU utilization when performing 1000-category image classification. We then correlate these performance figures with hardware events registered during inference, pointing out the critical aspects at both software and hardware level affecting each other. Finally, we show that some of the registered hardware events exhibit a strong correlation with power consumption as well. Overall, we are setting the foundations for the next step in our research, namely the development of a methodology for simple empirical modelling of DL inference on CPU-based embedded platforms.
II. P e r f o r m a n c e A n a l y s i s
A. Hardware Platform
All the experiments reported in this paper refers to the Quad Core ARM Cortex-A53 1.2GHz 64-bit CPU [22] [23] included in the Broadcom BCM2837 System-on-a-Chip (SoC) of the Raspberry Pi 3 Model B. Each core of this CPU is in turn an ARMv8-A processor capable of independently executing instructions.
Cortex-A53 processors exhibit two memory systems, namely Level 1 (LI) and Level 2 (L2). The LI memory system includes, per core processor, separate instruction and data caches (I-cache, D-cache), and a Memory Management Unit (MMU). The MMU in turn features one Translation Lookaside Buffer (TLB) -a two-level cache for instruction and data that translates between virtual and physical addresses. Instruction caching and dynamic branch prediction are also allowed in order to increase overall performance and reduce power con sumption. The L2 memory system contains a unified cache, which is shared between the cores. Specifically, for the SoC of the RPi, LI amounts to 32KB whereas L2 comprises 512KB.
Each ARMv8-A core implements the so-called Advanced Single Instruction Multiple Data architecture -commonly referred to as ARM NEON technology -as well as vector floating-point (VFP) operations for acceleration [24] , In addition to the SoC, the Raspberry Pi features 1GB RAM LPDDR2 900MHz, where we load the CNN weights and keep intermediate results while running the networks. The non-volatile storage capacity of the system is provided by an attached micro-SD card. The operating system is Raspbian [25] .
B. Software Frameworks
C affe [26] implements convolutions as image-to-column transformation (im2col) plus General Matrix-Matrix Multipli cation (GEMM), using Basic Linear Algebra Subprograms (BLAS) as the back-end for GEMM. According to our testsnot reported in this paper -OpenBLAS [9] is the BLAS library supported by the RPi CPU, and compatible with Caffe, that better leverages the four cores of the ARM Cortex-A5 3. TensorFlow [27] expresses computations as static graphs, which are built just once and run repeatedly for inference. It makes use of the Eigen library [10] to generate efficient parallel code for multicore CPUs. We installed pre-built TensorFlow 1.3.0 for RPi [28] . This version exploits ARM hardware optimizations -NEON and VFP -for computational acceleration. OpenCV [29] implements a module that allows the use of pre-trained models for inference from other frameworks. We took CNN model files from Caffe. OpenCV version 3.3.1 was compiled to exploit both ARM NEON and VFP optimizations as well. C affe2 [30] is designed to be lightweight, modular and mobileoriented. It also uses static graphs for network definition and the Eigen library for matrix calculation. Caffe2 is optimized for ARM CPUs with NEON.
C. Inference Performance
One of the consequences of the high computational demand of CNN models is that the temperature of the RPi's ARM Cortex-A53 SoC can rapidly increase during inference. This forces the CPU frequency, and thereby the throughput, to go down. To take this aspect into account, we measured the following four performance metrics after each processed image over a long period -6 minutes -of continuous inference:
• Throughput. It is calculated as the inverse value of the total time required per image when batch size is set to 1 -this includes the time required to read and pre-processing the image, perform the inference, extract the metrics and save the results.
• CPU utilization. It was measured by using the Python p s u t i l library.
• CPU frequency and tem perature. We used the v cg en cm d tool to check the variations of the SoC's temperature and frequency caused by the CNN-based inference. Although the ARM Cortex-A53 CPU ideally operates at a maximum frequency of 1.2 GHz, the chip temperature can alter the instantaneous CPU frequency.
To identify performance trends on each DL framework, these metrics were measured for three CNNs with different architectures capable of recognizing 1000 image categories, namely SqueezeNet [40] , GoogLeNet [41] , and ResNet-50 [42] . We used pre-trained implementations of these models provided by each framework [31] - [39] . Table I summarizes main architectural and operational aspects of them. Python was the coding language we used since it is the language through which all of the network definitions are available for all of the frameworks. Furthermore, all these DL tools use single precision floating-point data format (float32). Fig. 1 depicts the temporal evolution of the metrics above defined when performing image recognition with SqueezeNet. Fig. 2 shows the average values of CPU utilization and through put for all the cases during the 6-minute period inference. The following aspects must be emphasized:
• CPU utilization is quite stable for each framework over the inference period, but its average varies significantly among frameworks. Caffe reaches the highest value for the three network models tested.
• There are different patterns of temperature evolution.
When the temperature is approaching 80°C, the processor protects itself by downclocking, which in turn decreases the throughput. This has a great impact on the total number of processed images over the test period.
• In spite of the fact that Caffe is apparently the framework making the most of the CPU, its throughput is the lowest for the three CNNs. (Actually, we have observed this seeming contradiction for still two more models, namely Network-in-Network [43] and MobileNet [44] .)
Next, we delve into the details of hardware exploitation in order to elucidate the underlying reasons for this behavior, in particular for the contradiction arising in Caffe between CPU utilization and throughput.
III. H a r d w a r e E x p l o i t a t i o n A n a l y s i s

A. Methodology
We extracted statistics on the processing load and memory usage of the CPU for the four analyzed frameworks when per forming inference with SqueezeNet, GoogLeNet and ResNet-50. For the sake of a fair comparison, we set a fixed number of images, N = 50, to be processed in all cases. Otherwise, the resulting metrics would be biased by the different inference pace of each framework. These 50 images were randomly taken from the ImageNet dataset [45] -using different input images does not change the quantitative outcomes of our study. The targeted parameters were obtained from the p e rf tool [46] , which gathers data through counters and event monitors provided by the Performance Monitoring Unit (PMU) included in the Cortex-A53 processor. In particular, we gathered data related to PMU hardware events [47] . In order to dismiss statistics related to the load of the CNN model weightsour analysis is focused on inference processing -, a twophase approach was carried out. Firstly, we collected statistics when running the whole inference script (si). Secondly, we singled out the counts associated with the creation and load of the model architecture ($2). Thus, the statistics employed to compare hardware performance are derived as (si-S2 )/N, i.e., statistics that represent per-image performance. In order to re duce estimation errors -keep in mind that p e rf provides count estimates -, we averaged the values from 5 measurements. Fig. 3 depicts the most representative parameters among all the gathered statistics. Let us carefully examine them. First, note that the particular coding techniques and libraries making up Caffe render, for the RPi's CPU, the highest number of instructions ( Fig. 3(a) ) and demand the highest number of memory accesses (Fig. 3(c) ) in all cases. The processor does its best to cope with these requirements. That's why the rates of instructions per second (Fig. 3(b) ) and memory access per second (Fig. 3(d) ) are also the highest for Caffe, which in turn explains the fact that this framework reaches the highest value of CPU utilization mentioned in Section II-C. However, even executing more instructions per second and fetching more data per second than any other framework, Caffe attains the lowest throughput due to its notably greater demand of processing and memory as a whole. We must also point out that Caffe is the framework for which the CPU applies branch prediction more extensively (Fig. 3(k) ). This means that the processor executes instructions before knowing for sure whether they will be finally executed or not. If the prediction was correct, the result is available sooner, thereby accelerating inference. In the case of Caffe, the performance of the CPU in terms of branch prediction is poor (Fig. 3(1) ), adding up instructions uselessly executed. Concerning cache exploitation, Caffe is distinctively good at loading data at LI (Fig. 3(e) ) which will be successfully fetched later on (Fig. 3(f) ). The exploitation of L2 and TLB by Caffe is also notable (Figs. 3(g)-3 (j)) -note that the OpenBLAS library, exploited by Caffe, is highly oriented to this accomplishment [8] . This suggests that the main reason for the poor coupling between Caffe and this CPU could be a poor mapping between the high-level instructions in Caffe's source code and the processor's instruction set.
B. Experimental Results and Discussion
With respect to the other three frameworks, there are also differences to be highlighted. The instruction reduction with re spect to Caffe showed in Fig. 3(a) suggests a better exploitation of the ARM SIMD instruction set. In fact, these frameworks leverage the ARM hardware optimizations by compilation. TensorFlow stands out as the most efficient framework, requiring the minimum number of instructions and memory accesses to complete the inference (Fig. 3(a) and Fig. 3(c), respectively) . This characteristic, in conjunction with high rates of instruc tion execution and memory access (Fig. 3(b) and Fig. 3(d) , respectively), enable the highest average throughput achieved by TensorFlow for GoogLeNet and ResNet-50; OpenCV is the best option for SqueezeNet (see Fig. 2(b) ). The most effective framework in terms of branch prediction is Caffe2. Regarding cache memory exploitation, TensorFlow and Caffe2 present a similar performance. OpenCV makes a poor use of LI but is 
(e)-3(h)).
Concerning the differences between the three studied CNN architectures, number of executed instructions in Fig. 3(a) exhibit a concordance with the number of MAC operations reported in Table I -although each framework depicting a distinctive relationship, as explained above. Likewise, the more weights the network has (Table I) , the more data memory ac cesses it requires (Fig. 3(c) ). In addition, the extracted hardware metrics have a remarkable correlation with throughput for all the networks as highlighted in Fig. 5 , where previously reported values are scattered showing a nearly linear pattern.
C. Power Consumption
Besides explaining throughput and CPU usage as discussed, the statistics extracted with the p e r f tool also exhibit corre lation with power consumption. Fig. 4 depicts instantaneous power measured with a Keysight N6705C DC Power Ana lyzer vs. three hardware metrics simultaneously sampled every 10 milliseconds. This figure correspond to four consecutive GoogLeNet inferences running on Caffe. Similar results have been obtained for the other frameworks and networks. Table II summarizes the Pearson correlation coefficients between these metrics and power consumption for each framework running GoogLeNet. Note that the coefficients are greater than 0.66 in all cases, reaching a value of 0.95 for LI D-cache accesses during inference on TensorFlow. Taking into account the impor tance of power consumption in embedded vision applications, its relevance in optimization loops [48] , and how difficult its direct measurement is -supply pins must be accessible and special equipment like the aforementioned power analyzer is required the proposed hardware metrics constitute a simple way to characterize embedded platforms.
IV. C o n c l u s i o n
An optimal selection of DL software framework and DNN architecture for a particular embedded hardware platform def initely make a difference in terms of performance. Specific coding strategies and acceleration libraries implemented by the frameworks exploit the underlying hardware in diverse manners, giving rise to a wide range of inference rates and power profiling even on the same network model. Instead of a direct modelling of the expected performance and power consumption of DL frameworks and DNNs on a particular CPU-based platform, we propose to carry out such modelling through metrics of hardware exploitation. These metrics can be easily extracted through standard tools. In this paper we present a preliminary study that supports the applicability of our proposed approach. Our next step will be to develop a performance model based on such metrics and insert it into an optimization loop in order to determine the best selection according to prescribed specifications. 
