Neural Networks have become one of the most successful universal machine learning algorithms. ey play a key role in enabling machine vision and speech recognition, and are increasingly adopted in other application domains. eir computational complexity is enormous and comes along with equally challenging memory requirements both in regards to capacity and access bandwidth, which limits deployment in particular within energy constrained, embedded environments. In order to address these implementation challenges, a broad spectrum of new customized and heterogeneous hardware architectures have emerged, o en accompanied with co-designed algorithms to extract maximum bene t out of the hardware. Furthermore, numerous optimization techniques are being explored for neural networks to reduce compute and memory requirements while maintaining accuracy. is results in an abundance of algorithmic and architectural choices, some of which t speci c use cases be er than others.
INTRODUCTION
Over the last several years, neural networks (NNs) 1 have become incredibly successful. A huge variety of neural networks are increasingly deployed in conjunction with robotics, advanced driver assistance systems (ADAS), security monitors and many other applications. Furthermore, as they have the theoretical property of being a universal approximator which requires zero domain expertise, they are increasingly applied to previously unsolved problems, and sometimes to replace existing algorithms, unless of course the original algorithm is of much lower complexity. Note that the applications listed above are all embedded applications, and there is an increasing interest in training as well as inference in such environments. e challenge of deploying these networks lies in their compute and memory intensity, which poses the largest barrier to adoption particularly within the embedded space where compute resources, power and memory are at premium. Inference requires o en billions of operations and training for modern algorithms involves tens of single-precision exa ops to converge and has tens of millions of parameters [4] . e interest to apply these techniques in energy constrained environments has spawned a rise in algorithmic and architectural innovation. Algorithmic optimizations include topological transformations with pruning and compression schemes. In addition, the general trend towards transprecision computing [51, 78] can be nicely exploited within this particular application context. Extreme reduced precision neural networks for example, which take datatypes down to ternary or even binary representations can bring signi cant hardware cost savings and minimal accuracy impact, as visualized in Fig. 1 [8] . Architectural innovation is showcased by Google's TPU [41] , numerous start-up companies such as Nervana, Graphcore, GROC, and Cerebras, as well as a spectrum of recon gurable accelerators leveraging FPGAs. Each of these architectures brings their own inherent bene t. Overall, it is becoming increasingly di cult to predict which architecture will deliver what performance for which particular neural network.
is poses the key challenge that we address with our benchmark suite.
Benchmarks at their core encompass a suite of tests for evaluating performance or level of quality. When done well, benchmarking creates clarity by establishing fair baselines and providing representative comparisons between di erent platforms and compute fabrics. ey act as the antidote to product marketing and provide system designers a toolbox to avoid making poor choices where end systems fail to meet requirements such as throughput, power or cost, and delay product launch. e bene ts of a good benchmarking suite go beyond this and provide insights from all perspectives. Benchmarks can be of high bene t to hardware designers as well as end users. Benchmarks drive optimizations for semiconductor companies who are customizing compute fabrics for deep learning applications, and for end users standardized tests help drive optimal purchasing choices. Finally, for newcomers to the domain, benchmarking suites can o er objective summaries that introduce key gures of merit and basic choices as well as se ing expectations of the state of the art. is is an extremely complex design space to visualize, as shown in Fig. 2 . ere are numerous machine learning applications, and each of these can be trained with di erent datasets and di erent neural network models and variations, and depending on these factors (as well as numerical representations, learning techniques and hyperparameter selection) can produce di erent results, the key gure of merit being test error rate or conversely, accuracy. ere are numerous choices with di erent hardware platforms within the cloud and IoT spaces and everywhere in between. All of the implementation alternatives will deliver di erent performance in tera or giga operations per second (TOP/s or GOP/s), response time, power consumption, cost and required development e ort.
Within this space there are two main types of benchmarks: Machine Learning (ML) benchmarks and performance benchmarks. ML benchmarks are typically aimed at achieving low test error, independent of the hardware implications, therefore being of limited e ciency. Examples are the ILSVRC ImageNet competition, as well as more sophisticated e orts such as MLBench [48] . Performance benchmarks are agnostic of the target application, measuring performance characteristics such as throughput and power for characteristic compute pa erns. Even when tailored towards characteristic ML workloads, they do not capture the fact that for di erent hardware architectures, di erent compute pa erns should be used. Most importantly, they do not correlate their results regarding algorithmic optimization back to the application level target, which is accuracy, and therefore provide the necessary freedom and scope for algorithmic modi cations, an essential ingredient to extracting performance out of heterogeneous computing systems.
In this paper we present TiBench, a benchmarking suite that lies at the intersection of the machine learning and hardware communities and spans the full design space.
TiBench couples neural network performance with hardware performance and as such can provide insights as to what is the best possible combination within this design space for speci c use cases. Although there are a number of e orts emerging in this space, such as DeepBench and MLPerf, there is currently no comprehensive benchmarking suite in existence that addresses the scope of what is needed, and in particular targets embedded systems.
TiBench is unique in the way we support quantization ( ) which is an important optimization technique for neural networks and leveraged by many specialized hardware architectures. Furthermore, TiBench provides multiple tiers of tests (Ti) which can provide deep insights for the composition of complex systems and provide tradeo s between speed and accuracy across a broad range of systems. e main contribution of this paper is the de nition of TiBench, which has the following unique features:
• It is a multi-tiered approach that supports a range of compromises for benchmarking in regards to quality of prediction and e ort. In particular, TiBench supports theoretical results as a measuring stick, di erent computational pa erns for di erent neural networks, and combinations of microbenchmarks and full applications for addressing the end user design space.
•It supports algorithmic optimizations and levels of development e ort including naive and optimized implementations, by correlating everything at the application level's gures of merit.
• In particular, TiBench supports di erent approaches to quantization at all levels, which is essential for e cient, low power architectures.
• It supports a broad range of applications, both inference and training, and available systems from cloud to IoT.
TiBench is still in its early stages. We hope the community will help make this a valuable contribution to the Machine Learning eld. In this paper we provide the rst analysis of theoretical compute and memory requirements for both applications and candidate hardware platforms, which forms level 0 of our benchmark suite. We present initial experimental results to validate the benchmarking methodology, as well as outline plans for the remaining levels. e remainder of this article is structured as follows: We start with background on neural networks. Sec. 3 analyses the compute and memory requirements of a broad selection of networks. Sec. 4 provides details on di erent hardware architectures and how inference and training workloads can be mapped to them. is provides insights into the spectrum of implementation choices and how they are represented within the benchmark suite. In Sec. 5 we take a closer look at the key components, characteristics and challenges of a benchmarking suite in Machine Learning. Sec. 6 describes existing e orts in this space and Sec. 7 introduces the key concepts of TiBench. We evaluate our approach with experimental results in Sec. 8. Sec. 9 concludes the article and presents future directions. Full experimental results can be found in the appendix.
BACKGROUND ON NEURAL NETWORKS
is e ort focuses on neural networks (NNs), a class of machine learning algorithms that forms a subclass of arti cial intelligence. With its property of being a universal approximator [18] , NNs increasingly outperform and replace existing algorithms. NNs can also provide automation for previously unsolved applications, where no algorithms exist. No domain expertise is required, just su ciently large datasets together with a su ciently large topology for the network to train for a given accuracy target. ese factors contribute to NN's popularity. e design space (see Fig. 3 ) is complex. For every application there are many di erent types of NNs, and new algorithms continue to evolve. Furthermore, di erent types of datasets can be used. e resulting combinations can achieve di erent accuracy targets, and are accompanied by di erent compute requirements. Also, a neural network model is always paired with the particular framework in which it was trained, which can have impact on the accuracy. ere is a large application space for neural networks (see Table 1 ) with domains ranging from vision to natural language processing (NLP) to gaming and recommendation systems. In each domain, there are numerous tasks which are amenable for neural networks; for example, within the vision processing context: image classi cation, object detection, and semantic segmentation. Furthermore, these models can be trained using di erent training techniques. Note that it is not easy to de ne clear categories as terms overlap. For example, deep reinforcement learning techniques can be applied to any network. Seq2Seq networks is a full family of networks, while ResNet50, VGG, and InceptionV3 refer to speci c topologies. Table 1 shows the pool of candidate neural networks that we plan to use as part of our benchmark, including both inference and training. While there is a large breadth of neural networks, there are many common layer types being used, which are ideal to form levels 1 and 2 of TiBench. ese layer types equate to the basic computational pa erns and are based on previous analysis [1] . e most popular compute layers are fully connected, convolutional, pooling, normalization and recurrent layers. ese come with very di erent compute and memory requirements and are brie y discussed here. A more detailed description can be found in [75] . Fully connected layers compute the full cross product between input tensors (for example) and a vector of weights, the la er are determined during training. Summed to a bias, this is then fed into an activation function. Popular activation functions include the hyperbolic tangent function and the recti ed linear unit (ReLU). In convolutional layers, the output receives inputs from a small receptive eld of the previous layer.
is approach greatly reduces the number of parameters (or weights) involved and allows local features (e.g., edges, corners) to be found [47] . A basic 2D convolutional layer is similar to a fully connected layer except that: a) each neuron receives an image as input and produces an image as its output (instead of a scalar); b) each synapse learns a small array of weights which is the size of the convolutional window; and c) each pixel in the output image is created by the sum of the convolutions between all synapse weights and the corresponding images. Recurrent layers are characterized by the fact that they contain state over a sequence of input data. ere are many di erent options for the implementation of the recurrence within the layer, starting from simple recurrent layers, to GRUs or LSTM layers, which can be uni-or bidirectional, feature di erent numbers of feedback gates, and may include numerous specializations such as peepholes and CRCs. Beyond, these basic layer types, there are many layer combinations emerging, such as inception layers in GoogleNet [76, 77] , residual layers in ResNet models [36] , and so-called re modules [38] . During training using backpropagation with stochistic gradient descent, we need to compute the relative derivative to all inputs for these layers. is works out to be similar in compute pa erns to inference with transposed versions of the inputs whereby signi cantly larger amount of compute and memory is required [75] . However, additional compute such as batch normalization needs to be addressed.
Optimization Techniques
As mentioned in the introduction, the challenge lies within the compute and memory requirements which can o en preclude inference deployment within the IoT context. To alleviate the computational burden and maximize performance, many optimization techniques have been introduced.
Particularly successful techniques include pruning, compression, low rank approximations and quantization [33] . We discuss quantization, a speci c focus of this work, and pruning in more detail below. All of these techniques fall under the category of algorithmic optimizations. A representative benchmark supports and measures these, as they are essential for viable deployment solutions.
antization & Numerical Representations Transprecision computing is making strides in many application domains [51, 78] , and is highly e ective for neural network inference. In particular, quantization to reduced precision datatypes, including 8 bit xed point integer and below, as well as custom oating point formats. For example, quantized neural networks (QNNs) have been shown to work extremely well. On smaller image classi cation benchmarks such as MNIST, SVHN and CIFAR-10, QNNs achieve state of the art accuracy despite reduction in precision [17, 93] , even for partial or full binarization of fully connected and convolutional layers. XNOR-Net [67] applies convolutional BNNs on the ImageNet dataset with topologies inspired by AlexNet, ResNet and GoogLeNet, report top-1 accuracies of up to 51.2% for full binarization and 65.5% for partial binarization, while for the more challenging ImageNet benchmark, there is a small but noticable accuracy drop. e resulting solution can run signi cantly faster in hardware and might still pose an a ractive design trade-o . Furthermore, there is signi cant evidence that increasing network layer size can recuperate this drop in accuracy [27, 44, 56, 74, 91] . New quantization schemes show promising results using for example Half-wave Gaussian antization (HWGQ) [10] to take advantage of the Gaussianlike distribution of batch normalized activations. Furthermore, new training and optimization techniques [55, 96] work e ectively. e current lowest error rates for ImageNet classi cation have been achieved using ternarization [3, 94] as shown in Table  2 .
antization has been successfully applied to other tasks including 3D object recognition, facial expression recognition [50, 73] , optical character recognition as well as speech [31, 49, 70] . Even in training, research shows that 32bits are not really needed given the typical value ranges for weight and activation gradients and weight updates involved. Fixed point integers, half precision oating point (FP16), b oat16, expoint or block oating point representations show state-of-the-art performance [30, 45, 54, 89] . All of these need to be accurately re ected within the tests.
Pruning is is another popular optimization which has been shown to dramatically reduce memory requirements, through either synaptic pruning or lter pruning. When synaptic pruning is leveraged, irregular compute pa erns result which impact memory access e ciency, thus hardware architectures require support for sparse matrix representations to bene t from this [31] . Filter pruning yields regular compute pa erns and bene ts thereby a broader selection of platforms [33] .
NEURAL NETWORKS AND THEIR COMPUTE AND MEMORY REQUIREMENTS
We analyze neural networks with regards to their arithmetic compute, intermediate storage requirement and memory footprint. While actual hardware requirements depend on numerous a ributes, at this point we are characterizing the theoretical requirements in an architecturally independent way. For example, actual on-chip requirements and external memory requirements depend on implementation choices, but can be derived directly, so this analysis is useful to categorize the Inference Each NN layer (L0, L1, etc.) requires a speci c number of arithmetic operations O L0 , O L1 , O L2 in the form of multiplies, additions etc. We measure these in giga or tera operations respectively (GOPs, TOPs). e overall compute of a network with n layers, O total , is the sum of the compute in each individual layer (see eq. 1). We de ne the total modelsize W total as the sum of the weight requirements per layer measured in millions of elements (ME); this is independent of any choice in numerical representation. e real memory footprint can be derived by multiplying with the size of the given datatype (for example 32b for single precision oating point). We quantify the intermediate bu er requirement T total in an implementation neutral fashion. For this we calculate the sum of the required amount of tensors T i that precede each layer. ese are derived as the product of feature map dimensions (w i , h i ) and number of channels ( i ). Note that all of this applies to non-linear topologies such as DenseNet [37] ; however, our models currently do not re ect graph connectivity. We plan to address this in the future.
Training While training is currently the focus in the cloud, we expect that it will become essential in embedded as well as on-line learning takes o . In regards to requirements, we need to consider backpropagation in addition to inference. As depicted in Figure 4 , training requires additional data structures. First of all, symmetrically to the tensors T i , we need to bu er their gradients TG i . Furthermore, so-called weight gradients need to be stored WG i which are the derivative (in relation to the input weights) of the gradient TG i + 1. Depending on given optimization strategies, weight updates need to be bu ered as well. is results in roughly 3 times the bu er requirements for weights, and double the amount for tensors. Regarding compute, backpropagation requires roughly 3 times the inference compute for a single image of the training data set (plus 1 update operation per weight parameter). Overall compute needs to be multiplied with number of iterations and number of inputs in the training data set. Note that data dependencies are signi cantly more intricate and challenging for training. is is currently not re ected within the theoretical analysis.
Summary of Requirements Figure 5 visualize initial results, where for Seq2Seq models, we assume a sequence length of 3000 (based on the LSTM test case in DeepBench [20] ).
e key observations are as follows: First, the compute and memory requirements are on average very high. Mean model size is too big to t into most on-chip low latency memory (with 71.14MBytes), and compute is in the GOPs range for every single input datum. Second, there is a signi cant variation in all requirements for both training and inference as summarized in Table 3 . No simple generalizations can be made, even within subcategories such as image recognition, as models vary greatly depending on size and complexity of images, number of objects to be recognized, Assuming 8b datatypes for inference and 32b for training.
etc. e de ned parameters: O total , W total , T total , OT total , WU total , and TG total help describe the compute requirement for inference and training of each individual network and can be used for baseline computations, taking architectural constraints into consideration, and cross-correlated with roo ine models to provide rough performance guidance.
HARDWARE ARCHITECTURES FOR DEEP LEARNING
We discuss target hardware systems, their architectures and implementation alternatives. While we present details on cloud platforms, the focus of this article is on embedded systems. ere is a huge range in the types of hardware architectures used for machine learning applications, including CPUs, GPUs, FPGAs and specialized architectures. e eld has spawned signi cant new research in computer architecture and created so-called deep learning processing units (DPUs), which are specialized for this application domain and can be implemented either with ASICs or in FPGAs. Architectures can broadly be classi ed by the basic type of compute operation, memory bandwidth, level of parallelism, degree of specialization and inherent precision support. CPUs are widely used for ML applications, and are viewed as serial compute engines, optimized for single thread performance, with implicitly managed memory hierarchies (including three levels of caches), and support oating point operations. GPUs are vector processors that support smaller oating point formats (FP16) natively, most recently xed point 8bit integer formats, and have a mix of implicitly and explicitly managed memory. DPUs, such as Google's Tensor Processing Unit (TPU), work with tensors, have explicitly managed and specialized memory hierarchies and support integer operations. With newer generations, the boundaries between di erent hardware architectures are blurring. CPUs are usually multicore to support parallel processing, and incorporate vector processing units, GPUs are adding tensor processing units, and the TPU now supports oating point operations. FPGAs can support any of the above con gurations with explicitly managed memory. FPGAs are the most exible of all target hardware, and can be con gured to support any numeric representation, even bit-serial hardware architectures which provide run-time con gurable precision. Custom ASIC implementations, which minimize hardware cost and maximize performance, have emerged to exploit speci c precision arithmetic and customized memory systems. Tables 4 and 5 list many of these hardware targets along with published performance numbers. 3 One of the goals of TiBench is to provide a more systematic way to compare performance and accuracy between these systems, rather than relying on vendor reported metrics.
NVIDIA GPUs are some of the most popular hardware targets for machine learning, and newer families of chips have been introduced to speci cally accelerate this task. For example, the Volta architecture, introduced in 2018, was particularly designed to accelerate AI and incorporates tensor cores as a new feature, as well as improved FP32 and FP64 support for training in a data center se ing [22] . AMD announced the Vega GPU [24] with new deep learning instruction set operations, with the goal of obtaining parity with NVIDIA's high-end Tesla V100 datacenter GPUs. Both companies have low power GPUs: the AMD Vega mobile GPU [34] and NVIDIA Jetson TX2 [26] .
Google introduced its TPU in 2016 [71] , which was designed to accelerate Google's TensorFlow framework. e rst generation supported integer arithmetic with a massively parallel 8-bit matix multiply engine. e second generation TPU was anounced in May 2017 [41] , and the third generation in May 2018 [80] . ese newer chips boast improved memory performance as well as support for oating point speci cally aimed at training.
ere are a number of startups introducing custom hardware in this space. Within the cloud space, there are Graphcore, Cerebras, Groq, and Wave Computing. Within the embedded space, where the design constraints are even more stringent, we nd even more, as are listed in table 5 .
Most are secretive about the details of their designs, and this landscape is rapidly changing. Intel is investigating several custom accelerators including Nervana and Movidius. Fathom [7] is Movidius' ultra low power Neural Compute Stick which operates at about 1 Wa . At the extreme, binarized neural networks which are very high throughput at extremely low power, are exploited in the following ASICs: BinarEye [58] , BNN Custom Fabric [5] , Stripes Bitserial ASIC [42] , and IBM AI Accelerator [39] . Others exploit sparse computing engines, such as EIE and its successor ESE [31] , SCNN [66] , Cnvlutin [2] , Cambricon-S and Cambricon-X [92] .
FPGAs are an extremely popular platform for machine learning. As they are highly exible and can be used in a variety of di erent con gurations and support any arithmetic format, they can be fully customized towards speci c neural network topologies, thereby achieving high performance and e ciency. However, for the same reason, they are extremely di cult to characterize in general. FPGAs are available in the cloud, such as the Xilinx Ultrascale+ VU9P available as part of the public Amazon Web Services (AWS) cloud infrastructure. Within the embedded space, we have pioneered the rst binarized neural network accelerators [27, 84] and provided many proof points for customized reduced precision implementations [8] . Umuroglu et al. [86] demonstrates that run-time programmable precision can be achieved with a bitserial approach, providing highly a ractive performance on FPGAs, with li le overhead. Intel FPGAs have also been successfully applied to machine learning applications using a range of di erent numerical representations [63] .
e Microso Brainwave project [14] aims at applying FPGAs at datacenter scale using their own custom oating point representation. Focusing on the IoT market, La ice has announced binarized neural network libraries targe ing low power FPGAs and achieving 1TOPS/Wa [46] .
CHARACTERISTICS & CHALLENGES IN BENCHMARKING

Key Components of a Benchmark
A benchmark can be de ned as a set of standards used for evaluating performance or level of quality. A more practical de nition implies that the "set of standards" is supplied in the form of a well-de ned set of executable tests and measured regarding a speci c set of gures of merit. Sometimes additional items are included such as performance analysis or pro ling tools which can help shed light on system bo lenecks. Test infrastructure or a testbed can be provided to ensure reproducibility.
is makes particular sense when specialized and not easily available hardware systems are involved. Data management can be handled together with the benchmark suite and stored in an accessible location as for example with DAWNbench [16] , MIT's Eyeriss project [25] and the Request tournaments online score card [68] . In this article we di erentiate pro ling tools, test infrastructure, and measurements from the actual benchmark test suite (see Fig. 6 ). Somewhat related to benchmarking are modelzoos, such as OpenAI Gym [9] and rllab [21] , which are selections of sample code. ey are not necessarily aiming to be representative, and typically include simpli ed implementations to teach concepts.
TiBench focuses initially on the benchmark suite and measurements. Benchmarking can bring many insights. For end-users and system designers, it helps to estimate expected system-level performance and provides an understanding of what algorithms work best on which hardware platform. For hardware designers, benchmarks provide design perspectives and clear cut guidelines regarding what gures of merit ma er and what workloads look like. Neural networks are pushing the limits of what is possible, therefore careful system level co-design of hardware and algorithms, and realistic expectations of what is achievable given the design choices using benchmarking, are crucial. To bring maximum bene t, the following characteristics are essential which are discussed in greater detail below:
Characteristics
• representative of common workloads • supportive of algorithmic modi cations • objective and reproducible • portable to heterogeneous hardware systems • complexity vs accuracy tradeo • adaptive "living" benchmark supported by industry and academia
Representative Benchmarks need to be representative of real world workloads. In machine learning, this requires breadth across a spectrum of applications, algorithms and computational pa erns. Computational pa erns are important to maximize insights into di erent hardware architectures. Application coverage is essential as it provides more holistic insights into system level performance which can be hard to predict given the emerging complexity of increasingly heterogeneous hardware systems.
Support for algorithmic modi cation Algorithmic modi cations are inevitable to extract best possible performance out of diverse hardware systems, for example to take advantage of caching and parallel hardware resources. Within machine learning, so ware and hardware co-design are compulsory [29] for energy constrained compute environments. To support this algorithmic freedom within the benchmark suite, application coverage is essential, as we correlate hardware performance independent of the algorithm back to application performance, which is equivalent to accuracy in this context. However, optimized performance alone is not su cient, as not every system designer may be able to achieve it. We also need to re ect the out-of-the-box, naive performance. Both optimized and naive are representative of a speci c hardware platform, and the di erence gives a good indication of the development e ort involved. We believe both should be part of the benchmarks and be captured together with development time or lines of code. Speci cally for neural networks, quantization, compression, topological changes and pruning techniques are important optimization techniques that need to be considered.
Objective & Reproducible To provide clear di erentiation between marketing and scienti c e orts, reproducible and objective results that do not favour any particular system con guration or hardware architecture are needed. Reproducible results are a key ingredient in the move towards Open Science, however, what does reproducibility actually entail? In the context of the plethora of esoteric AI accelerators, is it su cient that an objective third party has validated the results? Or does it imply that everyone on the planet should be in a position to reproduce the results if they had access to the system at a reasonable cost? Some hardware systems are too expensive; for example, a NVIDIA V100 may be beyond someone's budget. Other hardware choices are only available for rent, such as Google's TPU versions as part of Google cloud.
Portability is a challenging subject as specialized hardware architectures come with their own design entry languages and compiler tool stacks. e community is fragmented by a huge choice of frameworks including Ca e, Tensor ow, Mxnet, eano, pytorch and Darknet. What is more, the prediction accuracy of a network depends on the choice of framework, since training data is passed through di erent preprocessing stages and numerical inaccuracies accumulate and manifest themselves as discrepancies. ese inaccuracies are exacerbated by the characteristics of oating point arithmetic [28] . As a result, models and frameworks are inherently tied together. ere are three basic choices: e rst is to constrain ourselves to exactly one framework as was done with Fathom [1] . Second, we could support all frameworks. However, given that we are dealing with di erent hardware backends, this causes an explosion in test infrastructure, as the number of tests multiplies with the number of frameworks. e nal choice and probably the cleanest, is to support one of the intermediate neural network representations such as ONNX [65] , NNEF [62] or TVM [83] , which provide translation between all popular frameworks. However, this requires hardware vendor support, which is currently limited.
Complexity vs Speed vs Accuracy Speed of result is essential, as the key purpose of a benchmark is to provide faster insights than developing the full end-system. ere is a trade-o between speed, benchmark complexity and the accuracy of the results. Benchmarks which provide application and algorithmic breadth may require a large number of tests thus making the benchmark suite inherently complex and limit the usefulness of the benchmark. Sometimes it is important to have less accurate predictions at a faster rate, and, for di erent users, di erent tradeo s are acceptable.
Adaptive As machine learning is a highly active research eld where algorithms change fast, the benchmark suite should be adaptive and able to incorporate emerging popular algorithms, compute pa erns and end applications.
RELATED WORK: EXISTING BENCHMARKING
In this section we take a look at existing benchmarks, and compare them regarding algorithmic scope and gures of merit.
TiBench di ers from these e orts in a number of ways: • Existing benchmarks do not address the fact that heterogeneous hardware platforms typically require co-designed algorithms, and o er exibility in precision for datatypes speci cally, although MLPerf has open models for training. We introduce correlation of application and architecture gures of merit to compare di erent combinations of algorithms and architectures at the application level. • We o er full visualization of the design space, rather than comparing performance for xed levels of accuracy. us, interesting trade-o s can be highlighted. • None of the existing benchmarks o er the some level of tiering, including theoretical level, and stacks of microbenchmarks that can In the following, we expand and elaborate on the di erences in greater detail. For this, we di erentiate between ML benchmarks, performance benchmarks and NN system benchmarks. ML benchmarks exclusively focus on application performance, which is accuracy.
ere is no consideration of compute e ort required or resulting execution time. Performance benchmarks record hardware performance only, speci cally throughput (measured in processed inputs per time or TOP/s), latency or response time in milliseconds (ms), and power consumption in Wa s. Performance benchmarks only look at hardware performance and are agnostic of the application. NN system benchmarks, as shown in Figure 7 lie at the intersection and are at the heart of what we are striving for. ey combine all gures of merit; both system performance and accuracy are correlated. In addition, functional correctness even during performance testing needs to be ensured.
NN System Benchmarks
TiBench falls into this family of benchmarking suites which are unique in that they combine representative machine learning workloads with gures of merit from hardware performance benchmarks. BenchIP [79] is a benchmarking suite which has a broad set of machine learning tasks. Similar to TiBench, BenchIP adopts a multi-tiered approach with micro-and macro-benchmarks. However BenchIP does not support the theoretical layer, which we use to cover compute e ciency and track benchmarking results. BenchIP also doesn't cover level 2, namely stacks of layers, which we believe bring great merit in isolating bo lenecks in data movement and highlighting problematic dimensionality in tensors. Finally BenchIP does not o er the concept of comparison via pareto curves which is essential to a) visualize the full scope of potential solutions within the design spectrum, and b) provide the necessary scope for algorithm optimizations matching the speci cs of various accelerators. Fathom [1] is probably the rst a empt to provide a representative workload for benchmarking that has algorithmic breadth beyond convolution neural networks inference and includes example training and unsupervised learning such as reinforcement learning and recurrent models. However, Fathom does not address the spectrum of numerical representations. It also does not support heterogeneous hardware platforms. In regards to framework strategy, Fathom advocates a uni ed so ware package, relying on compatible so ware stacks to emerge, and therefore only supports one framework, TensorFlow. With a primary focus on benchmarking for training and achieving application coverage rather than algorithmic breadth, TBD [95] adopts some of the concepts introduced in Fathom. It supports more frameworks and datasets and covers a range of applications, including image classi cation, machine translation, object detection, speech recognition, adversarial and deep reinforcement learning. MLPerf [57] is a promising approach at providing system level benchmarks. Similarly to Fathom and TBD, it covers a representative range of applications adding sentiment analysis and recommendation as target applications. It currently considers only training but inference is in process. MLPerf is created by a consortium of industry partners and universities, which should address objectivity criteria. Its key strengths are explicitly de ning gures of merit and its strong industrial support. It provides the concept of open models, which allow for algorithmic optimizations that facilitate performance improvements for speci c architectures. However, it does not explicitly support quantization.
DAWNBench [16] exclusively looks at ImageNet classi cation for training and inference. e benchmark sets very clear gures of merit such as "Time taken to train an image classi cation model to a top-5 test accuracy of 93% or greater" and "Latency required to classify one ImageNet image using a model with a top-5 test accuracy of 93% or greater" and as such supports the concept of algorithmic optimizations by tying hardware performance to accuracy achieved at the application level but falls short of visualizing the full design space. Finally DAWNBench does not provide further insights beyond the speci ed gures of merit, and is limited in application scope.
e Collective Knowledge Framework [15] in conjunction with the ASPLOS Request Tournament [68] , while narrow in scope (limited to ImageNet Classi cation inference), opens up the design space for di erent hardware accelerators, facilitating architecture speci c algorithmic transformations and correlation between accuracy and performance and power within a larger design space.
is is essential to support heterogeneous hardware architectures. ASPLOS excels in reproducibility, leveraging ACM artifact evaluation technology, and providing insight into hardware performance and error rate trade-o s, through an online scorecard.
ML Benchmarks
e Machine Learning community has de ned its own benchmarks which have an exclusive focus on achieved accuracy independent of the required compute, employing ensemble techniques and multi-crop which in essence, linearly scale up the compute load per input data. e most popular of these is the ImageNet Large Scale Visual Recognition (ILSVR) Challenge [69] . e associated compute requirements are unrealistic, particularly when deployed in energy-constrained environments. CortexSuite [81] and BenchNN [11] are limited to measuring accuracy, where CortexSuite is constraint to perception and cognition while BenchNN shows the value of machine learning for approximate computing, based on 5 out of the 12 recognition, mining and synthesis applications from the PARSEC benchmark suite. DjiNN and Tonic [35] focuses on deep learning tasks for warehouse scale computers including image, speech processing and natural language processing. While kaggle(www.kaggle.com) isn't speci cally a benchmark, it hosts a portfolio of data science challenges where the machine learning community competes with the latest topologies and algorithms for highest accuracy. MLBench [48] compares human derived learning algorithms against machine learning services from Amazon and Microso Azur.
Performance Benchmarks
DeepBench [20] is probably the most successful suite of microbenchmarks for neural network performance that measures and compares basic compute operations. It benchmarks individually direct convolutions, matrix multiply, and a speci c LSTM layer for single precision, half precision oating point and for some operations 8b xed point integer datatypes on hardware architectures. It currently features cloud deployment and some embedded data points on raspberry pi and iphone. It captures the most popular compute pa erns, however lacks support for lower precision 
datatypes, and exclusively investigates performance. As such it does not provide the mechanisms to tie algorithmic modi cations back to the application level, nor provide insights into compute performance for reduced precision representations. DeepBench also doesn't cover data movement bo lenecks between layers, as well as potential bo lenecks around bu ering state, as required for LSTMs for example, where capacity and access latency crucially impact overall speed. ere are more general, machine learning agnostic, hardware benchmarks such as TPC [82] for the data processing community, SHOC [19] , SPEC [72] and STREAM [52] . SHOC looks speci cally at how to benchmark heterogeneous hardware systems using OpenCL as design entry.
Similar to
TiBench, SHOC deploys microbenchmarks combined with application benchmarks and is multi-tiered. SPEC includes a broad range of applications including graphics, MPI, mail servers, virtualization, and storage, and STREAM exclusively focuses on memory bandwidth. None are speci cally designed for machine learning, and address the challenges of this application domain. gemmlowp [23] , while it is not a benchmark, is speci cally designed for matrix multiply operations; it includes low precision operations which may be suitable as a basis for implementation of part of our benchmark suite.
Summary Overall, support for algorithmic optimization is limited across the whole spectrum of benchmarks, in particular in regards to quantization and pruning. None of the benchmarks above provide a multi-tiered approach in the same way we do. ese can provide understanding of compute and data movement bo lenecks within the system, or o er theoretical levels with e ciency tracking. None of the benchmarks o er a fair comparison for co-design algorithms and full design space visualization. In Tables 6 and 7, we summarize the application scope of existing and our proposed benchmark, as well as the key di erentiators between existing benchmarks and our proposal and discuss in Sec. 7 how we address these characteristics.
THE BENCHMARK PROPOSAL
e targeted design space is vast and compromised of a multidimensional spectrum of algorithmic and architectural co-designed end solutions. e aim of the benchmark is to expose the spectrum of possibilities and accurately re ect the capabilities of the di erent hardware platforms.
TiBench has the following key characteristics: We take a multi-tiered approach which is one of our key contributions (Fig. 8) . We tier the benchmark suite with respect to abstraction levels as well as numerical representations for both training and inference tasks. is provides not only a ractive compromises in regards to speed versus minimal discrepancy with target workloads, but also brings advantages such as additional system level insights. e second key di erentiator of our approach is the support for algorithmic optimization by coupling hardware performance with accuracy at the application level. In particular, this allows for objective comparison between oating point implementations and reduced precision models that can achieve much higher performance at a signi cantly reduced energy cost, among many other possible optimization strategies. Results are visualized via pareto graphs (accuracy versus latency, throughput and throughput/power) and optimal solutions can be found along the pareto frontier.
ird, we include a theoretical level as a baseline for benchmarking and performance estimation. e unique characteristics of TiBench include test suites at various abstraction levels, algorithmic optimizations and quantization, in particular considerations in regards to datasets, hyperparameters and framework challenges, such as reproducibility and adaptibility (see [53] ).
Multiple Tiers -Abstraction Levels We de ned 4 levels of abstraction ( Fig. 8 ) discussed below. Level 0 -eoretical Records for all target hardware backends theoretically possible peak performance (TOps or GOps), external memory bandwidth (GBps), thermal design power (Wa s) and cost ($), and for all models their compute and memory requirements; datapoints are shown in Sec. 3 and 4. Combining application requirements with hardware platform characteristics can be leveraged for performance predictions using roo ine models [88] . Level 0 is a base layer, with results that are available instantly, and provide a target point of reference, guidance for optimization e orts and allows to compute metrics such as achievable compute e ciency. At level 0, we already introduce the notion of performance per datatype operation which is essential to support quantization as an algorithmic optimization.
Two tables are presented in the appendix, one for hardware characteristics and one for neural networks. e hardware table has one row per hardware platform and supported native datatype; a minimum of Half Precision (FP16), Single Precision (FP32) and INT8 are recorded 4 . In the second table, for each CNN, we record four values: total number of compute operations for a single input, the model size, the size of the state and the total amount of tensors in between layers that require bu ering. ese values can be used as a basis to derive memory requirements and compute requirements for both inference and training; examples are shown in Figure 5 .
Level 0 -Roo ine Analysis Using assumptions for where weights, tensors, gradients, weight updates and state of a neural network are stored, combined with the size of the datatypes used, allow us to derive the arithmetic intensity of a neural network during training and inference. Combined with the roo ine for a given hardware platform, we can provide insight as to whether a neural network will be memory or compute bound and guidance for what is theoretically possible (Fig. 9) .
Level 1 -Compute Patterns Level 1 exposes achievable compute performance for typical compute pa erns encountered within neural networks, which equates to popular layers including convolutions, fully connected layers, recurrent layers, residual layers, and squeeze layers, over a range of dimensions and with di erent numerical representations (Sec. 2). ese tests are comparable to DeepBench [20], with the signi cant di erence that we provide much broader support for specialized numerical representations. For each of these compute pa erns, and for both inference and training, we record the following gures of merit: measured performance (TOps or GOps), latency (ms), power consumption (Wa s) of the full platform in the embedded space, and of the board excluding the host system in the cloud. 5 While level 1 does not capture application level accuracy, the tests will include veri cation of functional correctness. e results should re ect achievable compute performance, excluding potential bo lenecks for moving data which are addressed in level 2. While requiring execution, the tests at level 1 are relatively rapid. We include a sweep over batch and thread sizes.
Level 2 -Compute & Data Movement Level 2 is comprised of simple combinations of level 1 tests, and can thereby e ectively capture potential bo lenecks such as tensor movement between layers, as well as storage requirements. It considers stacks of level 1 layers and only includes a subset of all possible combinations to keep test time to a minimum. We include mixed precision between layers in these small template stacks for both inference and training. Figures of merit are identical to level 1. In particular, the latency variation between level 1 with single fused layers and level 2 with layer stacks will bring insight into data movement and bu ering bo lenecks.
Level 3 -Applications Application coverage is essential to o er space for algorithmic innovation which can achieve superior system-level performance and can only be validated when combined with application results. As such, achieved accuracy becomes the bar for normalizing results, and independent of the neural network. We include the initially planned datasets and models (Table 6) , taken from existing benchmarks and complement these with models that have been explored to work well with pruning and quantization optimizations. Furthermore, contributors are welcome to provide di erent models for given machine learning tasks. See Appendix for complete list.
For inference, we include performance measurements for a single image. e error rate is the reported test error over the whole test dataset. For training, we report throughput, training time (latency), and power for a single image as well (including correctness tests). We also provide measurements over longer training sequences with speci c accuracy targets, for example, measure complete training time 90% top5 error for ImageNet classi cation with a ResNet50. Finally, we o er the option to optimize the training algorithm and network and record all possible data points in a multi-dimensional graph; for those it is essential to include development time. Similar concepts are being applied in MLPerf and Request [15, 57] . ere is no single criteria that decides whether one solution is optimal, as for di erent use cases, di erent gures of merit apply. All combinations yield di erent trade-o s within the multidimensional design space. As such, we present all solutions and measurements within multi-dimensional gures, whereby the pareto frontier represents the best possible compromises ( Fig. 1) .
Algorithmic Optimizations including antization is benchmarking proposal opens up the opportunity for algorithmic innovations. We include in this pruning and topological changes, while initially focusing on quantization and numerical representations. For this, we include, on every level of the benchmark several numerical representations, including FP32, FP16, INT8, BIN, TERN, and allow for arbitrary choices to be included, for example Microso 's custom oating point [40] . Training each neural network with di erent quantization approaches and di erent and potentially esoteric numerical representations is highly time-intensive.. erefore, careful logging of trained quantized models is a high priority for level 3.
Frameworks & Datasets Datasets are a key input to the benchmark and impact accuracy results. We rely on open source datasets exclusively. Framework support is expected to be one of the biggest challenges since each framework is directly connected with a neural network and datasets within an application context and models are not necessarily portable. erefore, we need operational hardware backends for a diverse set of AI accelerators which may or may not be available. Furthermore, quantization is not necessarily mainstream in frameworks. It is not yet clear to what extent cross compilation tools such as TVM [83] can help, while exchange formats such as ONNX [65] are still immature, lack adoption and very importantly full quantization support. Training scripts exposing all hyperparameters, training initializations and so on must be fully logged as they can have signi cant impact on accuracy.
Power and Energy To represent power and energy cost, we only report platform power measured at the socket. While this is not necessarily accurate, there are strong reasons behind this choice. First, the measurement needs to be fair, therefore we believe subsystems, including memory speci cally need to be taken into account. Second, more detailed current sampling on the platforms may be available on some platforms, but each platform comes with di erent interfaces, and may or may not provide access to all power rails. While the accuracy of typical socket power meters is around 10%, we found that these results remain representative of the systems. Furthermore, we average the results over 10 measurements.
Another consideration is whether to consider power or energy per frame. We se led on using absolute power consumption since when multithreading or batching is applied, it is hard to derive a representative number for energy and would di er depending on whether the end application is latency or throughput driven. Finally, idle power with these platforms, can represent a signi cant percentage of the overall power budget and would therefore cloud the observation. In particular one FPGA platform is an evaluation board with many peripherals, which is re ected in high idle power (19.9 Wa s) compared to the GPU (between 3.4 to 5.0 Wa depending on operating mode), while the additional dynamic power consumption is minimal and yields the FPGA overall as the more e cient platforms despite the initial load.
Testbeds, Reproducibility, & Recorded Measurements In order to provide useful scienti c results, all experiments and measurements must be validated and reproducible. Speci cally:
• All input data to the test suites must be openly accessible.
• Many platforms can be made available through virtualized compute environments, which is adequate if the cost is not prohibitive. However some platforms may not be available. erefore, an open testbed may be advisable and considered as an extension to this benchmark.
• As the higher levels of benchmarks may require a long time to run and hardware may not be available, we advocate recording of results, whereby each entry will be validated by a third party such that results are guaranteed to be a) reproducible and b) correct.
Our colleagues in the Request Tournament e ort [68] leverage ACM's rigorous artifact evaluation technology and the Collective Knowledge Work ow Framework [15] and do an outstanding job addressing this. We aim to adopt the same principles.
Adaptability Machine Learning is currently a highly dynamic eld, and speci c algorithms may become very quickly outdated and new models may emerge and take over rapidly. We plan to adapt fast and add/retire models as machine learning science matures.
EXPERIMENTAL RESULTS & EVALUATION
We present measured results aimed at evaluating the de ned benchmarking tests and gures of merit to ensure that they accurately re ect a system's capabilities. For test platforms, we used the Nvidia TX2 GPU and the Xilinx ZCU104 FPGA. For both platforms, we carried out all levels of tests on one speci c Machine Learning task, ImageNet classi cation, for two di erent neural networks, GoogleNetV1 and ResNet50. We use FP32, FP16 (supported by GPU), and INT8 (supported by FPGA) as numerical representations, a form of algorithmic optimization. We run GPU platforms with a spectrum of batch sizes and di erent operating modes (MaxN, MaxQ, MaxP), which are optimized for di erent performance and power consumption targets 6 . For FPGAs, there are a spectrum of implementations available. We exercise the Deephi DPU overlay, which uses threads instead of batch sizes to achieve high system utilization, and therefore exercise a spectrum of thread counts. For FPGAs we show the theoretical limits of the current implementation (which is clocked at 666MHz), as well as the datasheet peak performance of 750MHz. For GPUs, we use the theoretical peak as dictated by the clock frequencies de ned by the operating mode. Full experimental results are provided in the appendix. We currently have only exercised inference results to validate the benchmark methodology. In the following, we evaluate each benchmarking level individually and then provide a rst critical review of these early results.
Level 0 Using values for hardware platforms and arithmetic intensity (AI) we created roo ines for the target platforms and performance predictions for both networks 7 . Fig. 9 shows that both NNs will be compute bound for INT8, FP16 and FP32. e arithmetic intensity should be higher for larger batch sizes (batch size of 1 is shown), but the performance prediction for larger batch sizes will be identical. e theoretical performance prediction can be derived from this and is summarized in table 8. ese numbers are used to compute e ciency for levels 1, 2 and 3.
Level 1 and Level 2 We restrict the evaluation of level 1 and level 2 to ResNet50, as this is su cient to make the key observations. e ResNet50 topology is relatively regular in structure, consisting of a top convolutional layer with pooling combination, 16 residual blocks, and a fully connected layer. Each residual block is comprised of thresholding layers, convolutions, and elementwise additions. As the convolutions account for the majority of the compute, we focus mainly on the convolutional layers of the network. Since the platform-speci c frameworks perform layer fusion as network optimization, level 1 represents the smallest possible fused layer structure. Table 16 shows level 1 and level 2 latency results for one TX2 hardware con guration (MaxN, FP16) with di erent batch sizes as well as level 1 results for ZCU104 with di erent thread numbers. We restrict level 1 to convolutions of di erent sizes and select the residual layers res2a, res3a, res4a and res5a to get an overview over the whole network. Level 2 results are provided for all residual layers of the network. Due to limited support by the hardware-speci c framework, it is not possible to benchmark level 2 on FPGA platforms. We observe a large discrepancy in execution time for di erent residual stacks, even though the compute requirements within each is similar. It is likely that data movement varies signi cantly depending on the incoming and outgoing tensor dimensions. erefore, it is important to include as many layer types inside level 1 and 2 testing. We would expect this to be even more pronounced for other topologies, as they may be less balanced than ResNet50. We also observe a large discrepancy between the performance of di erent convolutional layers (Table 16 , level 1). Unlike the residual blocks, this is anticipated, as they come with very di erent compute requirements. Furthermore, the di erences are more pronounced with larger batch size. It is therefore our plan to include the full spectrum of convolutional layers within level 1. Fig. 10 . Performance comparison layer0, layer1, layer2 and layer3 for TX2 (MaxN, FP16 configuration) Multi-Tiered Concept Fig. 10 depicts the performance measurements of the various levels. We restricted the visualized experiments to MaxN, FP16 con guration on TX2, and a subset of microbenchmarks on level 1 and level 2, for a spectrum of batch sizes. Note that the theoretical peak performance is signi cantly higher than measured performance, only within reach of individual layers that t the hardware architecture well. e system (level 3) achieves from 41.1 to 60.7% e ciency, where larger batch sizes achieve higher performance. Level 2 results are on average more negative than achieved performance (level 3) and a fairly good approximation within 16% of the achievable level 3 system performance, but far o level 3 compute performance. Level 1 results have usually be er performance than the level 2 results. is makes intuitively sense, as a limited amount of bo lenecks are exposed during execution of the benchmark. In particular lower weight storage is required, which is most likely contained on-chip, thereby alleviating any potential memory bo lenecks. Also it can be said that the averaged level 1 results provide a good estimation of possible compute performance on level 3. As already mentioned, for level 1 and 2 results, we observe large variations in performance ranges for di erent dimensions of convolutions. e insight is that to provide a good projection from level 1 or level 2 to level 3, we need to provide full coverage of convolutional layers. Another challenge is that many backend tools perform automated layer fusion such as merging batch normalization with convolutions, which makes testing in isolation inaccurate.
Level 3 -Full system level performance evaluation e aim of level 3 is to explore optimal solutions within the design space regarding application performance independent of model topology and algorithmic optimizations. We include results for both platforms (TX2, ZCU104), for INT8, FP16, FP32, across the spectrum of batch sizes and thread numbers for both GoogleNetV1 and ResNet50. See plots of pareto points ( Fig. 11) and results in the Appendix. We made the following key observations: Firstly, the ZCU104 FPGA provides the highest system level (948GOPs) and compute level performance (1067GOPs) compared to the GPU platform (809GOPs and 1011GOPs respectively) for both GoogleNetV1 and ResNet5050 (Fig. 11, top le ) . For GoogleNetV1, the FPGA provides be er performance and accuracy. For ResNet50, the FPGA provides be er performance but lower accuracy compared to the GPU platform. Further, GoogleNetV1 topology provides more than 2x the performance compared to ResNet50, due to the signi cantly lower compute per frame required as part of the neural network topology, while ResNet50 provides best accuracy across the platforms. e accuracy di erence is 1.59% for the FPGA and 4.27% for the GPU (Fig. 11 top le ) . Additionally, the ZCU104 outperforms the TX2 in regards to latency by orders of magnitude and across topologies unless GPUs operate with small batch size, where the performance e ciency drops. GPU latency varies from a minimum of 8ms to a maximum of 1838.5ms for batch=128. FPGA latency varies from 9.65ms to 65ms. Finally, the GPU platform is more power e cient,which can be a ributed to the GPU platform being more optimized, whereas the FPGA platform is more general purpose. is is apparent when considering idle power (Sec. 7, 5 Wa s for TX2 and 19.9 for ZCU104).
Fig. 11. Level 3: System Performance Evaluation
In this evaluation above we consider full system-level performance (Fig. 11) , including initial data movement as well as compute only performance. Depending on the end application, it may be important to factor out the initial data movement from the overall time, as the inference engine might be included in a larger compute data path, where the inputs are streamed directly from on-chip resources. However when analyzing the experimental data points for both GPU and FPGA platforms, it appears that the di erence is very regular in nature, and it is not obvious that a distinction within the benchmark is necessary (see Appendix) as long as it is clearly indicated what is measured. e pareto curves are an e ective means to compare di erent topologies and di erent platforms leaving space for algorithmic optimizations. We plan to leverage 3-or 4-dimensional graphs to additionally explore relationships between latency and system-level performance.
CONCLUSION & FUTURE WORK
Neural networks are fast gaining popularity across an increasing number of applications. However, they are accompanied by challenging compute and memory requirements, as shown in Section 3, which is seriously challenging the semiconductor industry which is facing performance scalability issues. is is of particular importance for embedded computing environments, where real estate, power and available compute and memory resources are at a premium. As such the industry is turning to both algorithmic innovation in form of new topologies, quantization, and pruning strategies, as well as architectural innovation with more and more heterogeneous devices and the emergence of specialized DPUs. To facilitate be er insights into the increasingly complex space of end solutions which involve hardware-so ware codesign and evaluate new concepts in computer architecture, novel NN system benchmarks are needed.
TiBench is a proposed novel benchmarking methodology to help drive hardware innovation and provide insights for system level designers in understanding possible performance accuracy trade-o s for newly devised and ne-tuned algorithms combined with highly customized accelerators. Key contributions are that we provide concepts that allow benchmarking of highly optimized algorithms by tying hardware characteristics back to the end application, thereby providing the needed algorithmic freedom. Another key di erentiator in this benchmarking concept is the introduction of the multi-tiered approach including a theoretical level and consideration of a spectrum of numerical representations at all levels. As such the benchmark can provide insights at various abstraction levels. is brings two key advantages: a) it provides a spectrum of insights and users can choose from instant but perhaps crude results, to elaborate results which require longer evaluation; and b) the multi-tiered approach provides insights into system bo lenecks. For example, are the recurrent or the fully connected layers the challenge? Or is the bo leneck the data movement in between? We present initial experimental results on two types of neural network topologies aimed at image classi cation tasks, and exercise them on two di erent types of hardware platforms for all levels of the proposed benchmarks. We present some of the lessons learned while exercising the benchmarks, challenges encountered, and analyze the quality of the results in regards to real system performance at the various levels.
is e ort is just beginning. Future work will focus on re ning details and running broader experimentation. We plan to expand on level 0 results rst and build out test suites targeting FPGAs, GPUs, CPUs and DPUs within the embedded space. Many concepts regarding reproducibility need to be re ned, as well as automated so ware testing infrastructure as proposed by deep500.org. Also collaboration with larger e orts such as MLPerf will be bene cial to gain traction. We invite the research community to contribute to TiBench.
A APPENDIX: TABLE OF RESULTS 
