Neural Networks have become one of the most successful universal machine-learning algorithms. They play a key role in enabling machine vision and speech recognition and are increasingly adopted in other application domains. Their computational complexity is enormous and comes along with equally challenging memory requirements in regards to capacity and access bandwidth, which limits deployment in particular within energy constrained, embedded environments. To address these implementation challenges, a broad spectrum of new customized and heterogeneous hardware architectures have emerged, often accompanied with co-designed algorithms to extract maximum benefit out of the hardware. Furthermore, numerous optimization techniques are being explored for neural networks to reduce compute and memory requirements while maintaining accuracy. This results in an abundance of algorithmic and architectural choices, some of which fit specific use cases better than others.
different neural network models and variations and, depending on these factors (as well as numerical representations, learning techniques and hyperparameter selection), can produce different results, the key figure of merit being test error rate or, conversely, accuracy. There are numerous choices with different hardware platforms within the cloud and IoT spaces and everywhere in between. All of the implementation alternatives will deliver different performance in tera or giga operations per second (TOP/s or GOP/s), response time, power consumption, cost, and required development effort.
Within this space there are two main types of benchmarks: machine-learning (ML) benchmarks and performance benchmarks. Machine-learning benchmarks are typically aimed at achieving low test error, independent of the hardware implications, therefore being of limited efficiency. Examples are the ILSVRC ImageNet competition, as well as more sophisticated efforts such as MLBench [48] . Performance benchmarks are agnostic of the target application, measuring performance characteristics such as throughput and power for characteristic compute patterns. Even when tailored toward characteristic ML workloads, they do not capture the fact that for different hardware architectures, different compute patterns should be used. Most importantly, they do not correlate their results regarding algorithmic optimization back to the application-level target, which is accuracy, and therefore provide the necessary freedom and scope for algorithmic modifications, an essential ingredient to extracting performance out of heterogeneous computing systems.
In this article, we present QuTiBench, a benchmarking suite that lies at the intersection of the machine-learning and hardware communities and spans the full design space. QuTiBench couples neural network performance with hardware performance and as such can provide insights as to what is the best possible combination within this design space for specific use cases. Although there are a number of efforts emerging in this space, such as DeepBench and MLPerf, there is currently no comprehensive benchmarking suite in existence that addresses the scope of what is needed and in particular targets embedded systems. QuTiBench is unique in the way we support quantization (Qu), which is an important optimization technique for neural networks and leveraged by many specialized hardware architectures. Furthermore, QuTiBench provides multiple tiers of tests (Ti) that can provide deep insights for the composition of complex systems and provide tradeoffs between speed and accuracy across a broad range of systems.
The main contribution of this article is the definition of QuTiBench, which has the following unique features:
• It is a multi-tiered approach that supports a range of compromises for benchmarking in regards to quality of prediction and effort. In particular, QuTiBench supports theoretical results as a measuring stick, different computational patterns for different neural networks, and combinations of microbenchmarks and full applications for addressing the end user design space. • It supports algorithmic optimizations and levels of development effort, including naive and optimized implementations, by correlating everything at the application level's figures of merit. • In particular, QuTiBench supports different approaches to quantization at all levels, which is essential for efficient, low-power architectures. • It supports a broad range of applications, both inference and training, and available systems from cloud to IoT.
QuTiBench is still in its early stages. We hope the community will help make this a valuable contribution to the machine-learning field. In this article, we provide the first analysis of theoretical compute and memory requirements for both applications and candidate hardware platforms, which forms level 0 of our benchmark suite. We present initial experimental results to validate the benchmarking methodology, as well as outline plans for the remaining levels.
The remainder of this article is structured as follows: We start with background on neural networks. Section 3 analyses the compute and memory requirements of a broad selection of networks. Section 4 provides details on different hardware architectures and how inference and training workloads can be mapped to them. This provides insights into the spectrum of implementation choices and how they are represented within the benchmark suite. In Section 5, we take a closer look at the key components, characteristics, and challenges of a benchmarking suite in machine learning. Section 6 describes existing efforts in this space, and Section 7 introduces the key concepts of QuTiBench. We evaluate our approach with experimental results in Section 8. Section 9 concludes the article and presents future directions. Full experimental results can be found in the appendix.
There is a large application space for neural networks (see Table 1 ) with domains ranging from vision to natural language processing (NLP) to gaming and recommendation systems. In each domain, there are numerous tasks that are amenable for neural networks; for example, within the vision processing context: image classification, object detection, and semantic segmentation. Furthermore, these models can be trained using different training techniques. Note that it is not easy to define clear categories as terms overlap. For example, deep reinforcement learning techniques can be applied to any network. Seq2Seq networks is a full family of networks, while ResNet50, VGG, and InceptionV3 refer to specific topologies. Table 1 shows the pool of candidate neural networks that we plan to use as part of our benchmark, including both inference and training. While there is a large breadth of neural networks, there are many common layer types being used, which are ideal to form levels 1 and 2 of QuTiBench. These layer types equate to the basic computational patterns and are based on previous analysis [1] . The most popular compute layers are fully connected, convolutional, pooling, normalization, and recurrent layers. These come with very different compute and memory requirements and are briefly discussed here. A more detailed description can be found in Reference [75] . Fully connected layers compute the full cross product between input tensors (for example) and a vector of weights; the latter are determined during training. Summed to a bias, this is then fed into an activation function. Popular activation functions include the hyperbolic tangent function and the rectified linear unit (ReLU). In convolutional layers, the output receives inputs from a small receptive field of the previous layer. This approach greatly reduces the number of parameters (or weights) involved and allows local features (e.g., edges, corners) to be found [47] . A basic two-dimensional (2D) convolutional layer is similar to a fully connected layer except that (a) each neuron receives an image as input and produces an image as its output (instead of a scalar), (b) each synapse learns a small array of weights that is the size of the convolutional window, and (c) each pixel in the output image is created by the sum of the convolutions between all synapse weights and the corresponding images. Recurrent layers are characterized by the fact that they contain state over a sequence of input data. There are many different options for the implementation of the recurrence within the layer, starting from simple recurrent layers to GRUs or LSTM layers, which can be uni-or bidirectional, feature different numbers of feedback gates, and may include numerous specializations such as peepholes and CRCs. Beyond these basic layer types, there are many layer combinations emerging, such as inception layers in GoogleNet [76, 77] , residual layers in ResNet models [36] , and so-called fire modules [38] . During training using backpropagation with stochistic gradient descent, we need to compute the relative derivative to all inputs for these layers. This works out to be similar in compute patterns to inference with transposed versions of the inputs whereby a significantly larger amount of compute and memory is required [75] . However, additional compute such as batch normalization needs to be addressed.
Optimization Techniques
As mentioned in the Introduction, the challenge lies within the compute and memory requirements, which can often preclude inference deployment within the IoT context. To alleviate the computational burden and maximize performance, many optimization techniques have been introduced. Particularly successful techniques include pruning, compression, low rank approximations, and quantization [33] . We discuss quantization, a specific focus of this work, and pruning in more detail below. All of these techniques fall under the category of algorithmic optimizations. A representative benchmark supports and measures these, as they are essential for viable deployment solutions.
Quantization and Numerical Representations. Transprecision computing is making strides in many application domains [51, 78] and is highly effective for neural network inference. In particular, quantization to reduced-precision datatypes, including 8-bit fixed-point integer and below, as well as custom floating point formats. For example, quantized neural networks (QNNs) have been shown to work extremely well. On smaller image classification benchmarks such as MNIST, SVHN, and CIFAR-10, QNNs achieve state of the art accuracy despite reduction in precision [17, 93] , even for partial or full binarization of fully connected and convolutional layers. XNOR-Net [67] applies convolutional BNNs on the ImageNet dataset with topologies inspired by AlexNet, ResNet, and GoogLeNet, report top-1 accuracies of up to 51.2% for full binarization and 65.5% for partial binarization, while for the more challenging ImageNet benchmark, there is a small but noticable accuracy drop. The resulting solution can run significantly faster in hardware and might still pose an attractive design tradeoff. Furthermore, there is significant evidence that increasing network layer size can recoup this drop in accuracy [27, 44, 56, 74, 91] .
New quantization schemes show promising results using for example Half-wave Gaussian Quantization (HWGQ) [10] to take advantage of the Gaussian-like distribution of batch normalized activations. Furthermore, new training and optimization techniques [55, 96] work effectively. The current lowest error rates for ImageNet classification have been achieved using ternarization [3, 94] as shown in Table 2 . Quantization has been successfully applied to other tasks including 3D object recognition, facial expression recognition [50, 73] , optical character recognition, as well as speech [31, 49, 70] . Even in training, research shows that 32 bits are not really needed given the typical value ranges for weight and activation gradients and weight updates involved. Fixed-point integers, half-precision floating point (FP16), bfloat16, flexpoint, or block floating point representations show state-of-the-art performance [30, 45, 54, 89] . All of these need to be accurately reflected within the tests. Pruning. This is another popular optimization that has been shown to dramatically reduce memory requirements through either synaptic pruning or filter pruning. When synaptic pruning is leveraged, irregular compute patterns result that impact memory access efficiency, and thus hardware architectures require support for sparse matrix representations to benefit from this [31] . Filter pruning yields regular compute patterns and benefits thereby a broader selection of platforms [33] .
NEURAL NETWORKS AND THEIR COMPUTE AND MEMORY REQUIREMENTS
We analyze neural networks with regards to their arithmetic compute, intermediate storage requirement and memory footprint. While actual hardware requirements depend on numerous attributes, at this point we are characterizing the theoretical requirements in an architecturally independent way. For example, actual on-chip requirements and external memory requirements depend on implementation choices but can be derived directly, so this analysis is useful to categorize the different requirements. The scope of the analysis is currently constrained to the models shown in Figure 5 ; the planned scope is listed in the appendix.
Inference. Each NN layer (L0, L1, etc.) requires a specific number of arithmetic operations O L0 , O L1 , O L2 in the form of multiplies, additions, and so on. We measure these in giga or tera operations, respectively (GOPs, TOPs). The overall compute of a network with n layers, O total , is the sum of the compute in each individual layer (see Equation (1)). We define the total modelsize W total as the sum of the weight requirements per layer measured in millions of elements (ME); this is independent of any choice in numerical representation. The real memory footprint can be derived by multiplying with the size of the given datatype (for example, 32b for single-precision floating point). We quantify the intermediate buffer requirement T total in an implementation neutral fashion. For this, we calculate the sum of the required amount of tensors T i that precede each layer. These are derived as the product of feature map dimensions (w i , h i ) and number of channels (ch i ). Note that all of this applies to non-linear topologies such as DenseNet [37] ; however, our models currently do not reflect graph connectivity. We plan to address this in the future.
Training. While training is currently the focus in the cloud, we expect that it will become essential in embedded as well as on-line learning takes off. In regards to requirements, we need to consider backpropagation in addition to inference. As depicted in Figure 4 , training requires additional data structures. First, symmetrically to the tensors T i , we need to buffer their gradients TG i . Furthermore, so-called weight gradients need to be stored WG i , which are the derivative (in relation to the input weights) of the gradient TG i + 1. Depending on given optimization strategies, weight updates need to be buffered as well. This results in roughly 3 times the buffer requirements for weights and double the amount for tensors. Regarding compute, backpropagation requires roughly 3 times the inference compute for a single image of the training dataset (plus one update operation per weight parameter). Overall compute needs to be multiplied with number of iterations and number of inputs in the training dataset. Note that data dependencies are significantly more intricate and challenging for training. This is currently not reflected within the theoretical analysis. Summary of Requirements. Figure 5 visualize initial results, where for Seq2Seq models, we assume a sequence length of 3,000 (based on the LSTM test case in DeepBench [20]). The key observations are as follows: First, the compute and memory requirements are on average very high. Mean model size is too big to fit into most on-chip low-latency memory (with 71.14MB), and compute is in the GOPs range for every single input datum. Second, there is a significant variation in all requirements for both training and inference as summarized in Table 3 . No simple generalizations can be made, even within subcategories such as image recognition, as models vary greatly depending on size and complexity of images, number of objects to be recognized, and so on. The defined parameters, O total , W total , T total , OT total , WU total , and TG total , help describe the compute requirement for inference and training of each individual network and can be used for baseline computations, taking architectural constraints into consideration, and cross-correlated with roofline models to provide rough performance guidance.
HARDWARE ARCHITECTURES FOR DEEP LEARNING
We discuss target hardware systems, their architectures, and implementation alternatives. While we present details on cloud platforms, the focus of this article is on embedded systems. There is a huge range in the types of hardware architectures used for machine-learning applications, including CPUs, GPUs, FPGAs, and specialized architectures. The field has spawned significant new research in computer architecture and created so-called deep learning processing units (DPUs), which are specialized for this application domain and can be implemented either with ASICs or in FPGAs. Architectures can broadly be classified by the basic type of compute operation, memory bandwidth, level of parallelism, degree of specialization, and inherent-precision support. CPUs are widely used for ML applications and are viewed as serial compute engines, optimized for single thread performance, with implicitly managed memory hierarchies (including three levels of caches) and support floating point operations. GPUs are vector processors that support smaller floating point formats (FP16) natively, most recently fixed-point 8-bit integer formats, and have a mix of implicitly and explicitly managed memory. DPUs, such as Google's Tensor Processing Unit (TPU), work with tensors, have explicitly managed and specialized memory hierarchies, and support integer operations. With newer generations, the boundaries between different hardware architectures are blurring. CPUs are usually multicore to support parallel processing and incorporate vector processing units, GPUs are adding tensor processing units, and the TPU now supports floating point operations. FPGAs can support any of the above configurations with explicitly managed memory. FPGAs are the most flexible of all target hardware and can be configured to support any numeric representation, even bit-serial hardware architectures that provide runtime configurable precision. Custom ASIC implementations, which minimize hardware cost and maximize performance, have emerged to exploit specific precision arithmetic and customized memory systems. Tables 4 and 5 list many of these hardware targets along with published performance numbers. 2 One of the goals of QuTiBench is to provide a more systematic way to compare performance and accuracy between these systems rather than relying on vendor reported metrics.
NVIDIA GPUs are some of the most popular hardware targets for machine learning, and newer families of chips have been introduced to specifically accelerate this task. For example, the Volta architecture, introduced in 2018, was particularly designed to accelerate AI and incorporates tensor cores as a new feature, as well as improved FP32 and FP64 support for training in a data center setting [22] . AMD announced the Vega GPU [24] with new deep learning instruction set operations, with the goal of obtaining parity with NVIDIA's high-end Tesla V100 datacenter GPUs. Both companies have low-power GPUs: the AMD Vega mobile GPU [34] and NVIDIA Jetson TX2 [26] .
Google introduced its TPU in 2016 [71] , which was designed to accelerate Google's TensorFlow framework. The first generation supported integer arithmetic with a massively parallel 8-bit matix multiply engine. The second generation TPU was anounced in May 2017 [41] and the third generation in May 2018 [80] . These newer chips boast improved memory performance as well as support for floating point specifically aimed at training.
There are a number of startups introducing custom hardware in this space. Within the cloud space, there are Graphcore, Cerebras, Groq, and Wave Computing. Within the embedded space, where the design constraints are even more stringent, we find even more, as are listed in Table 5 .
Most are secretive about the details of their designs, and this landscape is rapidly changing. Intel is investigating several custom accelerators, including Nervana and Movidius. Fathom [7] is Movidius' ultra-low-power Neural Compute Stick, which operates at about 1W. At the extreme, binarized neural networks, which are very high throughput at extremely low power, are exploited in the following ASICs: BinarEye [58] , BNN Custom Fabric [5] , Stripes Bitserial ASIC [42] , and IBM AI Accelerator [39] . Others exploit sparse computing engines, such as EIE and its successor ESE [31] , SCNN [66] , Cnvlutin [2] , and Cambricon-S and Cambricon-X [92] . FPGAs are an extremely popular platform for machine learning. As they are highly flexible and can be used in a variety of different configurations and support any arithmetic format, they can be fully customized toward specific neural network topologies, thereby achieving high performance and efficiency. However, for the same reason, they are extremely difficult to characterize in general. FPGAs are available in the cloud, such as the Xilinx Ultrascale+ VU9P available as part of the public Amazon Web Services (AWS) cloud infrastructure. Within the embedded space, we have pioneered the first binarized neural network accelerators [27, 84] and provided many proof points for customized reduced-precision implementations [8] . Umuroglu et al. [86] demonstrates that runtime programmable precision can be achieved with a bitserial approach, providing highly attractive performance on FPGAs, with little overhead. Intel FPGAs have also been successfully applied to machine-learning applications using a range of different numerical representations [63] . The Microsoft Brainwave project [14] aims at applying FPGAs at datacenter scale using their own custom floating point representation. Focusing on the IoT market, Lattice has announced binarized neural network libraries targetting low-power FPGAs and achieving 1TOPS/W [46] . 
CHARACTERISTICS AND CHALLENGES IN BENCHMARKING 5.1 Key Components of a Benchmark
A benchmark can be defined as a set of standards used for evaluating performance or level of quality. A more practical definition implies that the "set of standards" is supplied in the form of a well-defined set of executable tests and measured regarding a specific set of figures of merit. Sometimes additional items are included such as performance analysis or profiling tools, which can help shed light on system bottlenecks. Test infrastructure or a testbed can be provided to ensure reproducibility. This makes particular sense when specialized and not easily available hardware systems are involved. Data management can be handled together with the benchmark suite and stored in an accessible location as for example with DAWNbench [16] , MIT's Eyeriss project [25] , and the Request tournaments online score card [68] . In this article, we differentiate profiling tools and test infrastructure and measurements from the actual benchmark test suite (see Figure 6 ). Somewhat related to benchmarking are modelzoos, such as OpenAI Gym [9] and rllab [21] , which are selections of sample code. They are not necessarily aiming to be representative and typically include simplified implementations to teach concepts. QuTiBench focuses initially on the benchmark suite and measurements.
Characteristics
Benchmarking can bring many insights. For end-users and system designers, it helps to estimate expected system-level performance and provides an understanding of what algorithms work best on which hardware platform. For hardware designers, benchmarks provide design perspectives and clear cut guidelines regarding what figures of merit matter and what workloads look like. Neural networks are pushing the limits of what is possible, therefore careful system-level co-design of hardware and algorithms, and realistic expectations of what is achievable given the design choices using benchmarking, are crucial. To bring maximum benefit, the following characteristics are essential, which are discussed in greater detail below:
• representative of common workloads • supportive of algorithmic modifications • objective and reproducible • portable to heterogeneous hardware systems • complexity vs accuracy tradeoff • adaptive "living" benchmark supported by industry and academia
Representative. Benchmarks need to be representative of real-world workloads. In machine learning, this requires breadth across a spectrum of applications, algorithms, and computational patterns. Computational patterns are important to maximize insights into different hardware architectures. Application coverage is essential, as it provides more holistic insights into system-level performance, which can be hard to predict given the emerging complexity of increasingly heterogeneous hardware systems. Support for algorithmic modification. Algorithmic modifications are inevitable to extract best possible performance from diverse hardware systems, for example to take advantage of caching and parallel hardware resources. Within machine learning, software and hardware codesign are compulsory [29] for energy constrained compute environments. To support this algorithmic freedom within the benchmark suite, application coverage is essential, as we correlate hardware performance independent of the algorithm back to application performance, which is equivalent to accuracy in this context. However, optimized performance alone is not sufficient, as not every system designer may be able to achieve it. We also need to reflect the out-of-the-box, naive performance. Both optimized and naive are representative of a specific hardware platform, and the difference gives a good indication of the development effort involved. We believe both should be part of the benchmarks and be captured together with development time or lines of code. Specifically for neural networks, quantization, compression, topological changes, and pruning techniques are important optimization techniques that need to be considered.
Objective and Reproducible. To provide clear differentiation between marketing and scientific efforts, reproducible and objective results that do not favour any particular system configuration or hardware architecture are needed. Reproducible results are a key ingredient in the move toward Open Science; however, what does reproducibility actually entail? In the context of the plethora of esoteric AI accelerators, is it sufficient that an objective third party has validated the results? Or does it imply that everyone on the planet should be in a position to reproduce the results if they had access to the system at a reasonable cost? Some hardware systems are too expensive; for example, an NVIDIA V100 may be beyond someone's budget. Other hardware choices are only available for rent, such as Google's TPU versions as part of Google cloud.
Portability. is a challenging subject as specialized hardware architectures come with their own design entry languages and compiler tool stacks. The community is fragmented by a huge choice of frameworks including Caffe, Tensorflow, Mxnet, Theano, pytorch, and Darknet. Moreover, the prediction accuracy of a network depends on the choice of framework, since training data are passed through different preprocessing stages, and numerical inaccuracies accumulate and manifest themselves as discrepancies. These inaccuracies are exacerbated by the characteristics of floating point arithmetic [28] . As a result, models and frameworks are inherently tied together. There are three basic choices: The first is to constrain ourselves to exactly one framework as was done with Fathom [1] . Second, we could support all frameworks. However, given that we are dealing with different hardware backends, this causes an explosion in test infrastructure, as the number of tests multiplies with the number of frameworks. The final choice, and probably the cleanest, is to support one of the intermediate neural network representations such as ONNX [65] , NNEF [62], or TVM [83] , which provide translation between all popular frameworks. However, this requires hardware vendor support, which is currently limited.
Complexity vs. Speed vs. Accuracy. Speed of result is essential, as the key purpose of a benchmark is to provide faster insights than developing the full end-system. There is a tradeoff between speed, benchmark complexity, and the accuracy of the results. Benchmarks that provide application and algorithmic breadth may require a large number of tests, thus making the benchmark suite inherently complex and limiting the usefulness of the benchmark. Sometimes it is important to have less accurate predictions at a faster rate, and, for different users, different tradeoffs are acceptable.
Adaptive. As machine learning is a highly active research field where algorithms change fast, the benchmark suite should be adaptive and able to incorporate emerging popular algorithms, compute patterns, and end applications.
RELATED WORK: EXISTING BENCHMARKING
In this section, we take a look at existing benchmarks and compare them regarding algorithmic scope and figures of merit. QuTiBench differs from these efforts in a number of ways:
• Existing benchmarks do not address the fact that heterogeneous hardware platforms typically require co-designed algorithms and offer flexibility in precision for datatypes specifically, although MLPerf has open models for training. We introduce correlation of application and architecture figures of merit to compare different combinations of algorithms and architectures at the application level. • We offer full visualization of the design space rather than comparing performance for fixed levels of accuracy. Thus, interesting tradeoffs can be highlighted. • None of the existing benchmarks offer the some level of tiering, including theoretical level, and stacks of microbenchmarks that can help isolate problematic data movement patterns and tensor dimensionalities. • Finally, there is a difference in scope. Most benchmarks currently focus foremost on training. In the following, we expand and elaborate on the differences in greater detail. For this, we differentiate among ML benchmarks, performance benchmarks, and NN system benchmarks.
Machine-learning benchmarks exclusively focus on application performance, which is accuracy.
There is no consideration of compute effort required or resulting execution time. Performance benchmarks record hardware performance only, specifically throughput (measured in processed inputs per time or TOP/s), latency or response time in milliseconds (ms), and power consumption in watts. Performance benchmarks only look at hardware performance and are agnostic of the application. NN system benchmarks, as shown in Figure 7 , lie at the intersection and are at the heart of what we are striving for. They combine all figures of merit; both system performance and accuracy are correlated. In addition, functional correctness even during performance testing needs to be ensured.
NN System Benchmarks
QuTiBench falls into this family of benchmarking suites that are unique in that they combine representative machine-learning workloads with figures of merit from hardware performance benchmarks. A full comprehensive comparison of all benchmarks can be found in Table 7 . BenchIP [79] is a benchmarking suite that has a broad set of machine-learning tasks. Similarly to QuTiBench, BenchIP adopts a multi-tiered approach with micro-and macro-benchmarks. However, BenchIP does not support the theoretical layer, which we use to cover compute efficiency and track benchmarking results. BenchIP also does not cover level 2, namely stacks of layers, which we believe bring great merit in isolating bottlenecks in data movement and highlighting problematic dimensionality in tensors. Finally, BenchIP does not offer the concept of comparison via pareto curves, which is essential to (a) visualize the full scope of potential solutions within the design spectrum and (b) provide the necessary scope for algorithm optimizations matching the specifics of various accelerators. Fathom [1] is probably the first attempt to provide a representative workload for benchmarking that has algorithmic breadth beyond convolution neural networks inference and includes example training and unsupervised learning such as reinforcement learning and recurrent models. However, Fathom does not address the spectrum of numerical representations. It also does not support heterogeneous hardware platforms. In regards to framework strategy, Fathom advocates a unified software package, relying on compatible software stacks to emerge, and therefore only supports one framework, TensorFlow. With a primary focus on benchmarking for training and achieving application coverage rather than algorithmic breadth, TBD [95] adopts some of the concepts introduced in Fathom. It supports more frameworks and datasets and covers a range of applications, including image classification, machine translation, object detection, speech recognition, and adversarial and deep reinforcement learning. MLPerf [57] is a promising approach at providing system-level benchmarks. Similarly to Fathom and TBD, it covers a representative range of applications adding sentiment analysis and recommendation as target applications. It currently considers only training but inference is in process. MLPerf is created by a consortium of industry partners and universities, which should address objectivity criteria. Its key strengths are explicitly defining figures of merit and its strong industrial support. It provides the concept of open models, which allow for algorithmic optimizations that facilitate performance improvements for specific architectures. However, it does not explicitly support quantization.
DAWNBench [16] exclusively looks at ImageNet classification for training and inference. The benchmark sets very clear figures of merit such as "Time taken to train an image classification model to a top-5 test accuracy of 93% or greater" and "Latency required to classify one ImageNet image using a model with a top-5 test accuracy of 93% or greater," and as such supports the concept of algorithmic optimizations by tying hardware performance to accuracy achieved at the application level but falls short of visualizing the full design space. Finally, DAWNBench does not provide further insights beyond the specified figures of merit and is limited in application scope.
The Collective Knowledge Framework [15] in conjunction with the ASPLOS Request Tournament [68] , while narrow in scope (limited to ImageNet Classification inference), opens up the design space for different hardware accelerators, facilitating architecture specific algorithmic transformations and correlation between accuracy and performance and power within a larger design space. This is essential to support heterogeneous hardware architectures. ASPLOS excels in reproducibility, leveraging ACM artifact evaluation technology, and providing insight into hardware performance and error rate tradeoffs, through an online scorecard.
ML Benchmarks
The machine-learning community has defined its own benchmarks that have an exclusive focus on achieved accuracy independent of the required compute, employing ensemble techniques and multi-crop that, in essence, linearly scale up the compute load per input data. The most popular of these is the ImageNet Large Scale Visual Recognition (ILSVR) Challenge [69] . The associated compute requirements are unrealistic, particularly when deployed in energy-constrained environments. CortexSuite [81] and BenchNN [11] are limited to measuring accuracy, where CortexSuite is constraint to perception and cognition while BenchNN shows the value of machine learning for approximate computing, based on 5 of the 12 recognition, mining, and synthesis applications from the PARSEC benchmark suite. DjiNN and Tonic [35] focuses on deep learning tasks for warehouse scale computers, including image, speech processing, and natural language processing. While kaggle(www.kaggle.com) is not specifically a benchmark, it hosts a portfolio of data science challenges where the machine-learning community competes with the latest topologies and algorithms for highest accuracy. MLBench [48] compares human-derived learning algorithms against machine-learning services from Amazon and Microsoft Azur.
Performance Benchmarks
DeepBench [20] is probably the most successful suite of microbenchmarks for neural network performance that measures and compares basic compute operations. It benchmarks individually direct convolutions, matrix multiply, and a specific LSTM layer for single-precision, half-precision floating point, and for some operations 8b fixed-point integer datatypes on hardware architectures. It currently features cloud deployment and some embedded data points on raspberry pi and iphone. It captures the most popular compute patterns; however, it lacks support for lowerprecision datatypes and exclusively investigates performance. As such, it does not provide the mechanisms to tie algorithmic modifications back to the application level and does not provide insights into compute performance for reduced-precision representations. DeepBench also does not cover data movement bottlenecks between layers, as well as potential bottlenecks around buffering state, as required for LSTMs, for example, where capacity and access latency crucially impact overall speed.
There are more general, machine-learning-agnostic, hardware benchmarks such as TPC [82] for the data processing community, SHOC [19] , SPEC [72] , and STREAM [52] . SHOC looks specifically at how to benchmark heterogeneous hardware systems using OpenCL as design entry. Similarly to QuTiBench, SHOC deploys microbenchmarks combined with application benchmarks and is multi-tiered. SPEC includes a broad range of applications, including graphics, MPI, mail servers, virtualization, and storage, and STREAM exclusively focuses on memory bandwidth. None are specifically designed for machine learning and address the challenges of this application domain. gemmlowp [23] , while it is not a benchmark, is specifically designed for matrix multiply operations; it includes low-precision operations that may be suitable as a basis for implementation of part of our benchmark suite.
Summary. Overall, support for algorithmic optimization is limited across the whole spectrum of benchmarks, in particular in regards to quantization and pruning. None of the benchmarks above provide a multi-tiered approach in the same way we do. These can provide understanding of compute and data movement bottlenecks within the system or offer theoretical levels with efficiency tracking. None of the benchmarks offer a fair comparison for co-design algorithms and full design space visualization. In Tables 6 and 7 , we summarize the application scope of existing and our proposed benchmark, as well as the key differentiators between existing benchmarks and our proposal and discuss in Section 7 how we address these characteristics.
THE BENCHMARK PROPOSAL
The targeted design space is vast and compromised of a multidimensional spectrum of algorithmic and architectural co-designed end solutions. The aim of the benchmark is to expose the spectrum of possibilities and accurately reflect the capabilities of the different hardware platforms. QuTiBench has the following key characteristics: We take a multi-tiered approach, which is one of our key contributions ( Figure 8 ). We tier the benchmark suite with respect to abstraction levels as well as numerical representations for both training and inference tasks. This not only provides attractive compromises in regards to speed versus minimal discrepancy with target workloads but also brings advantages such as additional system-level insights.
The second key differentiator of our approach is the support for algorithmic optimization by coupling hardware performance with accuracy at the application level. In particular, this allows for objective comparison between floating point implementations and reduced-precision models that can achieve much higher performance at a significantly reduced energy cost, among many other possible optimization strategies. Results are visualized via pareto graphs (accuracy versus latency, throughput, and throughput/power) and optimal solutions can be found along the pareto frontier. Third, we include a theoretical level as a baseline for benchmarking and performance estimation.
The unique characteristics of QuTiBench include test suites at various abstraction levels, algorithmic optimizations, and quantization, in particular considerations in regards to datasets, hyperparameters, and framework challenges, such as reproducibility and adaptibility (see Reference [53] ).
Multiple Tiers-Abstraction Levels. We defined four levels of abstraction ( Figure 8 ) discussed below.
Level 0-Theoretical. Records for all target hardware backends theoretically possible peak performance (TOps or GOps), external memory bandwidth (GBps), thermal design power (watts), and 
NLP -Machine Translation

IWSLT15 -Transformer NLP -Speech Recognition
Librispeech -DeepSpeech2
TIMIT -DeepSpeech Librispeech -DeepSpeech2
RNN -WSJ
NLP -Sentiment Analysis
IMDB -Seq-CNN - - - NLP -Language Modeling - babI -Memory Networks - -
Recommendation -Movies
MovieLens-20M -NCF --
Unsupervised Learning
Vision -Feature Extraction
- MNIST -Autoencoder - - Vision -Adversarial Learning - - Downsampled ImageNet -WGAN -
Recommendation
----
Deep Reinforcement Learning
Game -Go
Go -Mini-Go Learning -Atari ALE Atari ALE -Deep Q Atari2000 -A3C cost ($) and for all models their compute and memory requirements; data points are shown in Section 3 and 4. Combining application requirements with hardware platform characteristics can be leveraged for performance predictions using roofline models [88] . Level 0 is a base layer, with results that are available instantly and provide a target point of reference, guidance for optimization efforts, and allows us to compute metrics such as achievable compute efficiency. At level 0, we already introduce the notion of performance per datatype operation, which is essential to support quantization as an algorithmic optimization. Two tables are presented in the appendix, one for hardware characteristics and one for neural networks. The hardware table has one row per hardware platform and supported native datatype; a minimum of Half Precision (FP16), Single Precision (FP32), and INT8 are recorded. 4 In the second table, for each CNN, we record four values: total number of compute operations for a single input, the model size, the size of the state, and the total amount of tensors in between layers that require buffering. These values can be used as a basis to derive memory requirements and compute requirements for both inference and training; examples are shown in Figure 5 .
Level 0-Roofline Analysis. Using assumptions for where weights, tensors, gradients, weight updates, and state of a neural network are stored, combined with the size of the datatypes used, Fig. 8 . A multi-layered approach with precision support.
allows us to derive the arithmetic intensity of a neural network during training and inference. Combined with the roofline for a given hardware platform, we can provide insight as to whether a neural network will be memory or compute bound and guidance for what is theoretically possible (Figure 9 ). Level 1-Compute Patterns. Level 1 exposes achievable compute performance for typical compute patterns encountered within neural networks, which equates to popular layers, including convolutions, fully connected layers, recurrent layers, residual layers, and squeeze layers, over a range of dimensions and with different numerical representations (Section 2). These tests are comparable to DeepBench [20], with the significant difference that we provide much broader support for specialized numerical representations. For each of these compute patterns, and for both inference and training, we record the following figures of merit: measured performance (TOps or GOps), latency (ms), power consumption (watts) of the full platform in the embedded space and of the board excluding the host system in the cloud. 5 While level 1 does not capture application-level accuracy, the tests will include verification of functional correctness. The results should reflect achievable compute performance, excluding potential bottlenecks for moving data, which are addressed in level 2. While requiring execution, the tests at level 1 are relatively rapid. We include a sweep over batch and thread sizes.
Level 2-Compute and Data Movement. Level 2 is composed of simple combinations of level 1 tests and can thereby effectively capture potential bottlenecks such as tensor movement between layers, as well as storage requirements. It considers stacks of level 1 layers and only includes a subset of all possible combinations to keep test time to a minimum. We include mixed precision between layers in these small template stacks for both inference and training. Figures of merit are identical to level 1. In particular, the latency variation between level 1 with single fused layers and level 2 with layer stacks will bring insight into data movement and buffering bottlenecks.
Level 3-Applications. Application coverage is essential to offer space for algorithmic innovation, which can achieve superior system-level performance and can only be validated when combined with application results. As such, achieved accuracy becomes the bar for normalizing results and independent of the neural network. We include the initially planned datasets and models (Table 6) , taken from existing benchmarks and complement these with models that have been explored to work well with pruning and quantization optimizations. Furthermore, contributors are welcome to provide different models for given machine-learning tasks. See the appendix for complete list.
For inference, we include performance measurements for a single image. The error rate is the reported test error over the whole test dataset. For training, we report throughput, training time (latency), and power for a single image as well (including correctness tests). We also provide measurements over longer training sequences with specific accuracy targets, for example, measure complete training time 90% top5 error for ImageNet classification with a ResNet50. Finally, we offer the option to optimize the training algorithm and network and record all possible data points in a multi-dimensional graph; for those it is essential to include development time. Similar concepts are being applied in MLPerf and Request [15, 57] . There is no single criteria that decides whether one solution is optimal, as for different use cases, different figures of merit apply. All combinations yield different tradeoffs within the multidimensional design space. As such, we present all solutions and measurements within multi-dimensional figures, whereby the pareto frontier represents the best possible compromises (Figure 1) .
Algorithmic Optimizations Including Quantization. This benchmarking proposal opens up the opportunity for algorithmic innovations. We include in this pruning and topological changes, while initially focusing on quantization and numerical representations. For this, we include, on every level of the benchmark, several numerical representations, including FP32, FP16, INT8, BIN, and TERN, and allow for arbitrary choices to be included, for example, Microsoft's custom floating point [40] . Training each neural network with different quantization approaches and different and potentially esoteric numerical representations is highly time intensive. Therefore, careful logging of trained quantized models is a high priority for level 3.
Frameworks and Datasets. Datasets are a key input to the benchmark and impact accuracy results. We rely on open source datasets exclusively. Framework support is expected to be one of the biggest challenges since each framework is directly connected with a neural network and datasets within an application context and models are not necessarily portable. Therefore, we need operational hardware backends for a diverse set of AI accelerators, which may or may not be available. Furthermore, quantization is not necessarily mainstream in frameworks. It is not yet clear to what extent cross compilation tools such as TVM [83] can help, while exchange formats such as ONNX [65] are still immature, lack adoption and very importantly full quantization support. Training scripts exposing all hyperparameters, training initializations, and so on, must be fully logged, as they can have significant impact on accuracy.
Power and Energy. To represent power and energy cost, we only report platform power measured at the socket. While this is not necessarily accurate, there are strong reasons behind this choice. First, the measurement needs to be fair; therefore, we believe subsystems, including memory, specifically need to be taken into account. Second, more detailed current sampling on the platforms may be available on some platforms, but each platform comes with different interfaces and may or may not provide access to all power rails. While the accuracy of typical socket power meters is around 10%, we found that these results remain representative of the systems. Furthermore, we average the results over 10 measurements.
Another consideration is whether to consider power or energy per frame. We settled on using absolute power consumption since when multithreading or batching is applied, it is hard to derive a representative number for energy and would differ depending on whether the end application is latency or throughput driven. Finally, idle power with these platforms can represent a significant percentage of the overall power budget and would therefore cloud the observation. In particular, one FPGA platform is an evaluation board with many peripherals, which is reflected in high idle power (19.9W) compared to the GPU (between 3.4 and 5.0W depending on operating mode), while the additional dynamic power consumption is minimal and yields the FPGA overall as the more efficient platforms despite the initial load.
Testbeds, Reproducibility, and Recorded Measurements. To provide useful scientific results, all experiments and measurements must be validated and reproducible. Specifically:
• All input data to the test suites must be openly accessible.
• Many platforms can be made available through virtualized compute environments, which is adequate if the cost is not prohibitive. However, some platforms may not be available. Therefore, an open testbed may be advisable and considered as an extension to this benchmark. • As the higher levels of benchmarks may require a long time to run and hardware may not be available, we advocate recording of results, whereby each entry will be validated by a third party such that results are guaranteed to be (a) reproducible and (b) correct.
Our colleagues in the Request Tournament effort [68] leverage ACM's rigorous artifact evaluation technology and the Collective Knowledge Workflow Framework [15] and do an outstanding job addressing this. We aim to adopt the same principles.
Adaptability. Machine learning is currently a highly dynamic field, and specific algorithms may become very quickly outdated and new models may emerge and take over rapidly. We plan to adapt fast and add/retire models as machine-learning science matures.
EXPERIMENTAL RESULTS AND EVALUATION
We present measured results aimed at evaluating the defined benchmarking tests and figures of merit to ensure that they accurately reflect a system's capabilities. For test platforms, we used the Nvidia TX2 GPU and the Xilinx ZCU104 FPGA. For both platforms, we carried out all levels of tests on one specific machine-learning task, ImageNet classification, for two different neural networks, GoogleNetV1 and ResNet50. We use FP32, FP16 (supported by GPU), and INT8 (supported by FPGA) as numerical representations, a form of algorithmic optimization. We run GPU platforms with a spectrum of batch sizes and different operating modes (MaxN, MaxQ, MaxP), which are optimized for different performance and power consumption targets. 6 For FPGAs, there are a spectrum of implementations available. We exercise the Deephi DPU overlay, which uses threads instead of batch sizes to achieve high system utilization and therefore exercise a spectrum of thread counts. For FPGAs we show the theoretical limits of the current implementation (which is clocked at 666MHz), as well as the datasheet peak performance of 750MHz. For GPUs, we use the theoretical peak as dictated by the clock frequencies defined by the operating mode. Full experimental results are provided in the appendix. We currently have only exercised inference results to validate the benchmark methodology. In the following, we evaluate each benchmarking level individually and then provide a first critical review of these early results.
Level 0. Using values for hardware platforms and arithmetic intensity (AI) we created rooflines for the target platforms and performance predictions for both networks. 7 Figure 9 shows that both NNs will be compute bound for INT8, FP16, and FP32. The arithmetic intensity should be higher for larger batch sizes (batch size of 1 is shown), but the performance prediction for larger batch sizes will be identical. The theoretical performance prediction can be derived from this and is summarized in Table 8 . These numbers are used to compute efficiency for levels 1, 2, and 3.
Level 1 and Level 2. We restrict the evaluation of level 1 and level 2 to ResNet50, as this is sufficient to make the key observations. The ResNet50 topology is relatively regular in structure, consisting of a top convolutional layer with pooling combination, 16 residual blocks, and a fully connected layer. Each residual block is composed of thresholding layers, convolutions, and elementwise additions. As the convolutions account for the majority of the compute, we focus mainly on the convolutional layers of the network. Since the platform-specific frameworks perform layer fusion as network optimization, level 1 represents the smallest possible fused layer structure. Table 16 shows level 1 and level 2 latency results for one TX2 hardware configuration (MaxN, FP16) with different batch sizes as well as level 1 results for ZCU104 with different thread numbers. We restrict level 1 to convolutions of different sizes and select the residual layers res2a, res3a, res4a, and res5a to get an overview over the whole network. Level 2 results are provided for all residual layers of the network. Due to limited support by the hardware-specific framework, it is not possible to benchmark level 2 on FPGA platforms. We observe a large discrepancy in execution time for different residual stacks, even though the compute requirements within each is similar. It is likely that data movement varies significantly depending on the incoming and outgoing tensor dimensions. Therefore, it is important to include as many layer types inside level 1 and 2 testing. We would expect this to be even more pronounced for other topologies, as they may be less balanced than ResNet50. We also observe a large discrepancy between the performance of different convolutional layers (Table 16 , level 1). Unlike the residual blocks, this is anticipated, as they come with very different compute requirements. Furthermore, the differences are more pronounced with larger batch size. It is therefore our plan to include the full spectrum of convolutional layers within level 1.
Multi-Tiered Concept. Figure 10 depicts the performance measurements of the various levels. We restricted the visualized experiments to MaxN, FP16 configuration on TX2, and a subset of microbenchmarks on level 1 and level 2 for a spectrum of batch sizes. Note that the theoretical peak performance is significantly higher than measured performance, only within reach of individual layers that fit the hardware architecture well. The system (level 3) achieves from 41.1 to 60.7% efficiency, where larger batch sizes achieve higher performance. Level 2 results are on average more negative than achieved performance (level 3) and a fairly good approximation within 16% of the achievable level 3 system performance but far off level 3 compute performance. Level 1 results have usually better performance than the level 2 results. This makes intuitively sense, as a limited amount of bottlenecks are exposed during execution of the benchmark. In particular, lower weight storage is required, which is most likely contained on-chip, thereby alleviating any potential memory bottlenecks. Also it can be said that the averaged level 1 results provide a good estimation of possible compute performance on level 3. As already mentioned, for level 1 and 2 results, we observe large variations in performance ranges for different dimensions of convolutions. The insight is that to provide a good projection from level 1 or level 2 to level 3, we need to provide full coverage of convolutional layers. Another challenge is that many backend tools perform automated layer fusion such as merging batch normalization with convolutions, which makes testing in isolation inaccurate. Level 3-Full system-level performance evaluation. The aim of level 3 is to explore optimal solutions within the design space regarding application performance independent of model topology and algorithmic optimizations. We include results for both platforms (TX2, ZCU104), for INT8, FP16, and FP32, across the spectrum of batch sizes and thread numbers for both GoogleNetV1 and ResNet50. See plots of pareto points ( Figure 11 ) and results in the appendix. We made the following key observations: First, the ZCU104 FPGA provides the highest system-level (948GOPs) and compute-level performance (1067GOPs) compared to the GPU platform (809GOPs and 1011GOPs, respectively) for both GoogleNetV1 and ResNet5050 (Figure 11, top left) . For GoogleNetV1, the FPGA provides better performance and accuracy. For ResNet50, the FPGA provides better performance but lower accuracy compared to the GPU platform. Further, GoogleNetV1 topology provides more than 2× the performance compared to ResNet50, due to the significantly lower compute per frame required as part of the neural network topology, while ResNet50 provides best accuracy across the platforms. The accuracy difference is 1.59% for the FPGA and 4.27% for the GPU (Figure 11, top left) . Additionally, the ZCU104 outperforms the TX2 in regards to latency by orders of magnitude and across topologies unless GPUs operate with small batch size, where the performance efficiency drops. GPU latency varies from a minimum of 8ms to a maximum of 1838.5ms for batch=128. FPGA latency varies from 9.65 to 65ms. Finally, the GPU platform is more power efficient, which can be attributed to the GPU platform being more optimized, whereas the FPGA platform is more general purpose. This is apparent when considering idle power (Section 7, 5W for TX2 and 19.9 for ZCU104).
In this evaluation, we consider full system-level performance (Figure 11 ), including initial data movement as well as compute-only performance. Depending on the end application, it may be important to factor out the initial data movement from the overall time, as the inference engine might be included in a larger compute data path, where the inputs are streamed directly from on-chip resources. However, when analyzing the experimental data points for both GPU and FPGA platforms, it appears that the difference is very regular in nature, and it is not obvious that a distinction within the benchmark is necessary (see the appendix) as long as it is clearly indicated what is measured. The pareto curves are an effective means to compare different topologies and different platforms leaving space for algorithmic optimizations. We plan to leverage three-or four-dimensional graphs to additionally explore relationships between latency and system-level performance.
CONCLUSION AND FUTURE WORK
Neural networks are fast gaining popularity across an increasing number of applications. However, they are accompanied by challenging compute and memory requirements, as shown in Section 3, which is seriously challenging the semiconductor industry, which is facing performance scalability issues. This is of particular importance for embedded computing environments, where real estate, power, and available compute and memory resources are at a premium. As such, the industry is turning to both algorithmic innovation in form of new topologies, quantization, and pruning strategies, as well as architectural innovation with more and more heterogeneous devices and the emergence of specialized DPUs. To facilitate better insights into the increasingly complex space of end solutions that involve hardware-software codesign and evaluate new concepts in computer architecture, novel NN system benchmarks are needed.
QuTiBench is a proposed novel benchmarking methodology to help drive hardware innovation and provide insights for system-level designers in understanding possible performance accuracy tradeoffs for newly devised and fine-tuned algorithms combined with highly customized accelerators. Key contributions are that we provide concepts that allow benchmarking of highly optimized algorithms by tying hardware characteristics back to the end application, thereby providing the needed algorithmic freedom. Another key differentiator in this benchmarking concept is the introduction of the multi-tiered approach, including a theoretical level and consideration of a spectrum of numerical representations at all levels. As such, the benchmark can provide insights at various abstraction levels. This brings two key advantages: (a) It provides a spectrum of insights, and users can choose from instant but perhaps crude results to elaborate results that require longer evaluation, and (b) the multi-tiered approach provides insights into system bottlenecks. For example, are the recurrent or the fully connected layers the challenge? Or is the bottleneck the data movement in between? We present initial experimental results on two types of neural network topologies aimed at image classification tasks and exercise them on two different types of hardware platforms for all levels of the proposed benchmarks. We present some of the lessons learned while exercising the benchmarks and challenges encountered and analyze the quality of the results in regards to real system performance at the various levels.
This effort is just beginning. Future work will focus on refining details and running broader experimentation. We plan to expand on level 0 results first and build out test suites targeting FPGAs, GPUs, CPUs, and DPUs within the embedded space. Many concepts regarding reproducibility need to be refined, as well as automated software testing infrastructure as proposed by deep500.org. Also collaboration with larger efforts such as MLPerf will be beneficial to gain traction. We invite the research community to contribute to QuTiBench.
A APPENDIX: TABLE OF RESULTS
In this section, we provide additional detailed data points: Table 9 provides an overview of planned applications, datasets and models. Table 10 shows level 0 data for QuTiBench. Difference between system-level and compute performance is visualized in Figure 12 . Tables 11 and 12 summarize  level 1 performance measurements, and Tables 13, 14 , and 15 show level 2 and 3 measurements respectively. Finally, Table 16 lists latency discrepancy between different convolutional and residual layers for level 1 and 2. 
