Deep neural networks have become the readiest answer to a range of application challenges including image recognition, stock analysis, natural language processing, and biomedical applications such as seizure detection. All while outperforming prior leading solutions that relied heavily on hand-engineered techniques. However, deployment of these neural networks often requires high-computational and memory-intensive solutions. These requirements make it challenging to deploy Deep Neural Networks (DNNs) in embedded, real-time low-power applications where classic architectures, GPUs and CPUs, still impose significant power burden. Systems-on-Chip (SoC) with Field-programmable Gate Arrays (FPGAs) can be used to improve performance and allow more fine-grain control of resources than CPUs or GPUs, but it is difficult to find the optimal balance between hardware and software to improve DNN efficiency. In the current research literature there have been few proposed solutions to address optimizing hardware and software deployments of DNNs in embedded low-power systems. To address the computation resource restriction and low-power needs for deploying these networks, we describe and implement a domain-specific metric model for optimizing task deployment on differing platforms, hardware and software. Next, we propose a DNN hardware accelerator called Scalable Low-power Accelerator for real-time deep neural Networks (SCALENet) that includes multithreaded software workers. Finally, we propose a heterogeneous aware scheduler that uses the DNN-specific metric models and the SCALENet accelerator to allocate a task to a resource based on solving a numerical cost for a series of domain objectives. To demonstrate the applicability of our contribution, we deploy nine modern deep network architectures, each containing a different number of parameters within the context of two different neural network applications: image processing and biomedical seizure detection. Utilizing the metric modeling techniques integrated into the heterogeneous aware scheduler and the SCALENet accelerator, we demonstrate the ability to meet computational requirements, adapt to multiple architectures, and lower power by providing an optimized task to resource allocation. Our heterogeneous aware scheduler improves power saving by decreasing power consumption by 10% of the total system power, does not affect the accuracy of the networks, and still meets the real-time deadlines. We demonstrate the ability to achieve parity with or exceed the energy efficiency of NVIDIA GPUs when evaluated against Jetson TK1 with embedded GPU SoC and with a 4× power savings in a power envelope of 2.0W. When compared to existing FPGA-based accelerators, SCALENet's accelerator and heterogeneous aware scheduler achieves a 4.8× improvement in energy efficiency.
INTRODUCTION
Deep neural networks (DNN), including Convolutional Neural Networks (CNN), are the baseline implementation for computer vision tasks and obtain an order-of-magnitude improvement over conventional techniques. As deep learning continues to permeate other fields, the challenge becomes deploying such networks in embedded low-power systems. Deployment in these situations requires an efficient network implementation that meets the accuracy, power, and latency requirement of a targeted application [8, 26, 30, 35, 39, 46] . Graphics processing units (GPUs), predominately by NVIDIA, have shown [32, 34, 39] the ability to significantly boost efficiency, during training, by exploiting the underlying data-level and task-level parallelism that exists within neural networks (NN). While massively outperforming their CPU counterparts, GPUs, including mobile GPUs, have a power envelop that is still too high for a large variety of low-power embedded settings [31, 34, 39] . While deploying data-rich applications, embedded systems have strict power and latency budgets that require local processing to perform feature extraction and classification [33, 39] . Embedded designs require understanding the limitations and trade-offs of deploying a task to a particular platform. An example of this is in biomedical applications where a solution must be provided within a fixed window of time. Providing guidance and understanding wherein a design needs a hardware accelerator saves a large amount of development and testing. The focus of this work is in deploying a highly flexible and efficient solution that includes the complexity of heterogeneous implementations of DNN in low-power designs. We accomplish this by contributing through addressing two problems statements:
• Understanding the relationship between DNN kernels and resource utilization in hardware and software. • Allocation of DNN tasks to the right resources for optimizing the desired design objectives.
The first contribution is reducing the complexity to design deployment by exploring many DNN tasks in hardware and software and creating a series of performance metric models that include memory access and data exchange. Our second contribution is the proposed SCALENet hardware accelerator. Finally, we propose a heterogeneous aware scheduler shown in Figure 1 that uses the metric models and SCALENet to find the most optimal task to resource allocation based on the desired design objectives such as latency, resources or power. Our contributions create a metric model based on empirical data from simulation and benchmarking of the DNN kernels from nine different reference networks and the platform chosen. This method creates a hardware and software models that can be extrapolated to larger designs with differing kernel acceleration trade-offs at a level of fidelity that does not exist in current research. This provides a realistic cost analysis in terms of processing, communications, and resource consumption that each decision causes. Current literature also lacks the ability to deploy a hardware only or co-designed solution that SCALENet provides based on design parameters that include latency of first solution. Finally, unlike other solutions the design goal of SCALENet is to provide a balanced solution in terms of resource and power to meet requirements, not to maximize throughput at a cost of increased power consumption. To demonstrate these contributions applicability, we explore nine NN architectures using the design guidance provided by the heterogeneous aware scheduler and the SCALENet accelerator. Our evaluation includes both computer vision datasets, CIFAR and IMAGENet, and the biomedical application of seizure detection. The image processing tasks contain enough variance in their architecture to illustrate different deployment configurations for SCALENet. Additionally, the use of a biomedical application in seizure detection provides our proposed solution for real-time low-power applications. The main contributions of the article include:
• A domain-specific approach for metric modeling and finding the relationship between resources and DNN tasks. • SCALENet's hardware accelerator for DNNs.
• Task allocation to a platform formulated as a constrained cost problem and the solution using empirical-based heuristics for metric models. • SCALENet's heterogeneous aware scheduler.
• An evaluation of SCALENet's heterogeneous scheduler and implementations along with comparison of the hardware accelerator on nine different DNNs, two image processing data sets, and a biomedical seizure detection application.
The remainder of the article is organized as follows. In Section 2, we begin with a brief background discussing previous methods for deployment of DNNs on an embedded system. Complexity insights gained from modern state-of-the-art network architectures and conclude with an overview of the scheduling of heterogeneous platforms. Section 2 continues with a brief overview of DNN tasks including convolution and fully connected (FC) layers. Section 3 reviews relevant work in the areas of FPGA-based accelerators and heterogeneous tasking. Section 4 outlines the core principles of creating a metric model for DNN specific heterogeneous tasking. Our key contributions of the article are provided in Sections 5, 6, and 7. Section 5 discusses the proposed SCALENet accelerator and architecture, including the hardware, software, and design flow. Section 6 is focused on the heterogeneous aware scheduler and how it utilizes the methods from earlier for domain characterization, allocation, and metric model generation for SCALENet and the associated tasking. Section 7 discusses the proposed DNN heterogeneous scheduler. Section 8 provides the implementation and evaluations of the purposed SCALENet and scheduler on reference networks. Finally, Section 9 compares the proposed hardware accelerator and scheduler to existing DNN accelerators and implementations on commercial off-the-shelf hardware (COTS) solutions.
BACKGROUND
In recent years there has been a focus on the deployment and application of DNNs for a variety of problem sets, including natural language processing, computer vision, and biomedical. With each dataset exploration, there has been the development of NN accelerators using differing network architectures. These accelerators focus on one or two kernel functions to accelerate or acceleration of a single fixed network. The solutions infer either an FPGA [9, 13, 32] or an ASIC [8, 11, 40] for ease of development, or for lower power and higher performance. Performance of smaller and less complex networks have contributed greatly to understanding the requirements in achieving high network accuracy. The most pertinent of these network changes is the research provided by [14, 42] , where improving upon the state-of-the-art does not require increasing the computationally and memory requirements in networks, but through architectural innovation. The network changes include the reduction in the number of fully connected layers and replacement of those layers with smaller filters and convolutional layers [43] . Another contribution from Reference [43] was the increase in performance of the training stage with little effect on the network accuracy. Addressing the complexity of computation in CNNs has been a focus recently with alternative solutions to the direct convolution method. Another method for convolution is to use an FFT [45] to reduce the number of calculations and speed up training. The FFT convolution [2] , see Section 2.1, is efficient with large filter sizes and requires only multiplication between data and filter due to the operation occurring in the frequency domain. When evaluating the performance of these networks advances, each has focused on maximizing the performance. These advancements localize improvement such that clear performance gains out weight the consideration of the power or resource consumption of the final design.
For the allocation of resources to DNN task assignment, we treat the problem of hardware and software as a domain-specific heterogeneous computing (DSHC) problem. DSHC is a method of analyzing performance and associating tasking to different processor architectures. With this work, we apply the definition of heterogeneous processing to an application's co-execution in a combination of hardware and software. When restricted to this work with DNNs, the use of domain-specific abstraction enable performance gains at the heterogeneous model per [6, 18] . By applying the generalization of DSHC, it is clear that a small set of algorithmic operations within an application are executed disproportionately and will benefit the most from targeted task assignment [2, 44] . To asses, the resource utilization of the DNN kernels and thus divide the computational tasking in a beneficial method for this heterogeneous system requires a solvable method to use for evaluation. The most common method is to evaluate in four categories: task characterization, the allocation problem, allocation approaches, and analysis.
Common DNN Tasks
In DNN architectures there exist two phases, a training phase and an inference phase. The training stage works on a known set of annotated data samples to create a model. Training a model implements the back-propagation algorithm, which iteratively updates parameters and model weights to improve the predictive of the model. Depending on the application, the training phase can be used to fine-tune a trained model where the weights of a previously trained network are used to initialize the parameters of a new training phase. The process then modifies the old weights for a new constraint such as a different dataset. Typically, the training phase only occurs once per model. The second phase is the inference phase that uses the trained model to classify new data Fig. 2 . A side-by-side comparison between a fully connected and a 1D convolutional layer for input X and output Y . The edges designate multiplication between input and corresponding weight. For the convolutional layer, edge color designates tied weights. In fully connected layers there exists dense connectivity between inputs and neurons. samples on inputs not previously seen. Running data forward through a model implements the inference phase, each time a new sample for a new classification. This thesis focuses on the acceleration of the inference phase and the computations of those forward-propagating layers in DNNs. There exists in the DNN domain a large variety of layer types including fully connected, convolutional, pooling, batch normalization, and other special functions as such activation functions. Out of all the computational layers, convolutional layers and fully connected layers are often the most highly utilized in networks and contain the majority of the complexity in the form of computation and memory. In the following subsection, we quickly review the details for fully connected layers, convolution layers, and the activation layers.
Convolution.
Convolutional layers can be seen as a form of structured sparsification that significantly reduces complexity while also being able to improve training by reducing the parameter space. Convolution can also be represented as an FC layer with two constraints: the first is the neurons are connected to only a limited subset of inputs that are in a local neighborhood. For 1D convolution layers, this corresponds to the temporal closeness of inputs, and for 2D convolution layers, this corresponds to the spatial closeness of inputs. The second constraint is it enforces extensive weight sharing between neurons. These two constraints mathematically correspond to performing a series of convolution operations between the input and a set of filters. A convolution layer comprises multiple filter banks, which we refer to as feature maps. Each feature map is fed all the input feature channels that contain temporal/spatial data and produces a corresponding output feature channel. This is achieved by convolving each input channel with a unique filter and summing across the convolved outputs to produce the output feature channel. Figure 2 presents a side-by-side comparison of a fully connected layer and a 1D convolutional layer. The figure highlights the sparse connectivity obtained using a convolutional layer and the potential use of weight sharing. In the example, the fully connected layer requires performing 2 × 5 × 7 = 70 operations and storing 7 × 5 = 35 weights, while the convolutional layer requires performing 2 × 5 × 3 = 30 operations and storing 3 weights.
Direct Convolution (Direct-Conv).
The direct method is the classic approach to a convolution that uses a sliding window technique. The filter strides over the data and at each stride position perform a Multiply-Accumulate (MAC) operation between the data patch and corresponding filter. Equation (1) describes this operation as the sum of the subset of one-dimensional data d multiplied by the one-dimensional filter f:
For example, with a two-dimensional data set that has an N × N dimension and filter of size K × K (N > K), the number of patches is calculated by (N − K + 1) 2 when using a single stride. For the 36:6 C. Shea and T. Mohsenin two-dimensional data, the operation sums the MAC results between patches of data channels and filter channels to obtain an output. Figure 3 shows the convolution between a single channel 4 × 4 data and a 2 × 2 filter. The data is divided into nine patches that undergo the MAC operation to get nine pixels of the 3 × 3 output.
FFT Convolution (FFT-Conv)
. Convolution in the time domain is equivalent to multiplication in frequency domain. References [1, 2] have shown that depending on the size of the filter, the reduction in math operations for FFT convolution can be a faster implementation than the standard direct convolution operation. The first step with the FFT method is to pad the filter and data to a size of FFT (N + K − 1), where the data is a N × N matrix, and the filter is k × k matrix. Next, the filter and data are independently transformed into the frequency domain by an FFT of (N + K − 1) length. After transforming the data and filter the two matrices are element-wise multiplied, and an inverse FFT is applied to obtain the output, as detailed in Equation (2) . A drawback of this method is the FFT convolution requires that the filter is padded to size (N + k − 1) causing the intermediate storage size for the coefficients to increase due to the zero padding. Figure 4 shows an example of a filter of dimension 2 × 2 proceeding to be zero padded to 5 × 5 to match the input data size's FFT transform:
2.1.4 Fully Connected. In fully connected (FC) layers there exists a unique edge between the inputs or previous layer's outputs, and each of the neurons. Each neuron is, therefore, performing a dot-product of its inputs with a unique weight vector. With fully connected layers that are highdimensional data constructs the dense connectivity and large parameter set make it challenging to learn a meaningful transformation:
Activation Layer
Activation layer functions are also known as transfer layers, are shown in Figure 5 , and are used to decide if a neuron should be fired or activated. The functions come in two different types: linear activation and non-linear activation. Linear functions are useful in simple datasets that do not have the complexity or various parameters, such as binary neural networks. Non-linear activation layers are currently the most popular and are used to generalize a variety of differentiating between outputs. The most commonly used DNN activation is the rectified linear unit (ReLU), which has the following Equation (4):
To further understand the activation process, we can describe the activation of a neuron in a layer l that is stored in an activation column-vector a l , where the superscript index denote the layer. The connections from the neurons in layer l − 1 to the layer l are stored in a weight matrix W l , and the biases for each neuron is stored in a bias column-vector b l , as seen in Equation (5):
Equation (5) is a more complete mathematical representation of fully connected layers that includes the activation function present in each neuron.
RELATED WORK

DNN Partitioning on FPGAs
A number of recent works [20, 36, 37] have investigated novel solutions to efficiently accelerate DNNs on FPGAs as well as edge devices. Of the more recent designs space solutions for DNN accelerators, a small set of these has focused on the partitioning of functionality in heterogeneous systems. In Reference [32] , we described a fixed function CNN accelerator that operated on a 16bit floating point (FP) data, had a fused function accelerator in the FPGA that combined convolution, ReLU, and a simplified batch normalization. While non-CNN layers are executed in naïve implementation in software. The CNN accelerator outperformed, in regards GOPS/w/s, similar architectures and even NVIDIA TK1 and TX1 embedded GPUs. We also provided direction integration with Torch, a scientific computing framework with wide support for machine-learning algorithms using the LUA language. However, we provided little discussion on how the partitioning was chosen, nor on how the data transfers were taken into account on the final solution.
In Reference [9] , the effort focuses on extending the Caffe [21] deep-learning framework to enable OpenCL-based designs for Xilinx FPGAs, a Xilinx Virtex 7 XC7VX690T, using Xilinx's SDAccel software. The design targets only the FPGA and only uses the host processor for programming, passing data, and launching kernels. They fix the data precision of the accelerator at 32bit FP and only provides for a fixed function acceleration method. The authors proposed a modified version of the Winograd convolution algorithm with a 3 × 3 convolution layer and unit stride. Their algorithm includes a method for reducing digital signal processing (DSP) requirements by pre-computing a partially transformed filter and a run-time transformed input data, which are combined to create the output. Their work concludes with a modification to the Caffe framework to allow pipelined layers to eliminate the necessity to synchronize memory with the host between OpenCL kernels and to enable pipeline parallelism in multiple layers and target.
Reference [13] proposes a custom framework for CNN deployment and claims high power efficiency in FPGAs with a design mapping flow. Both include support for parametrised and run-time configurations. Reference [13] contributes an implementation of quantization with a strategy that uses dynamic radix point position, 8bit or 16bit, across different layers and a fixed 3 × 3 filter size. In this design there is no discussion on design trade-offs on function acceleration in FPGA fabric versus software execution on the Zynq's ARM processor.
In Reference [27] , the authors stipulate that Hardware/Software co-design for hybrid platforms is becoming a processing distribution challenge. This challenge is mainly due to significant variation in different applications' performance and the large number of possible platforms that could be targeted. To address the authors propose a neural-network-based cross-platform performance estimation tool, called XPPE. XPPE uses the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. The proposed advantage of XPPE is it enables developers to explore the design space without requiring them to fully implement and map the application to multiple platforms.
Reference [28] asserts that current DNNs have reached a point that surpasses human vision. Due to this increased depth the number of computations and thus the power necessary per classification has grown considerably. To address this the authors proposed a iterative CNN process that takes a single feed-forward network into a series of sequentially executed smaller networks. Where the smaller networks process a sub-sample of input, and features are extracted from previous network to enhance the classification accuracy. This process is creates a dynamically approximated solution, which allows for the possibility of early termination and thus performs the classification with far fewer operations.
The work in Reference [25] proposed a method for applying a sparse Winograd algorithm to CNN on FPGAs with a batching method for transforming inputs. The design continues with an architecture that accelerates the Winograd-based convolution in the FPGA. This design uses a unique algorithm solution that breaks the problem into three parts pre, post and communication processing elements, where each has its functionality. In this case of this article, there is no discussion on why the work was divided this way and provides no insight into the design trade-offs.
Reference [10] explores the use of block-circulant matrices and Fast Fourier Transform (FFT)based fast multiplication to reduce computational complexity. The authors present a universal DNN inference engine built around a configurable size FFT building block and take advantage of block-circulant matrices to reduce storage complexity as the circulant reformat of non-square matrices enables data to be represented as a set of vectors. This work concludes with a series of benchmarks of the design in both FPGA, ASIC, and embedded software. Similarly, authors in [15] propose cyclic sparsely connected (CSC) layers as an overlay for FC layers to compress DNNs in which FC layers dominate the model size. Both methods in References [10] and [15] reduce the computational complexity of dense layers from O(n log n) down to O(n).
The authors of Reference [29] propose an approach for energy-efficient CNNs through decomposition of sequentially executed multiple stage CNNs (MS-CNN). They assert that MS-CNNs form a contextual awareness of input data in initial stage. This contextual awareness could be used to dynamically change the structure and connectivity of such networks to reduce their computational complexity. They propose three run-time optimization policies illustrating how the proposed policies construct a dynamic architecture for a wide range of applications with various accuracy requirements, resources, and time-budget, without the need for network re-training.
Heterogeneous Tasking
The authors in Reference [38] investigate the scheduling challenges of multithreaded applications on composite core architectures. They developed a systematic approach to predict optimal configuration for executing multithreaded workloads on the composite cores architectures. This is accomplished using an offline machine-learning approach to predict core type, voltage and frequency to maximize energy-efficiency.
Reference [18] outlines a problem set and solution for solving task mapping in a heterogeneous compute stack. Their contributions include presenting an approach for modeling run-time characteristics, a metric for modeling applications in a specific domain (finance), and provide an allocation of how tasking can be solved. Their solution set includes CPUs, GPUs, and FPGAs and provides an in-depth explanation of why tasks are assigned to each processing platform. However, as their focus is time insensitive problem and looks to do a gross estimation on tasking their final solution does not map to our problem set.
In Reference [49] , the authors define a framework they call Synergy, that allows for heterogeneous processing. The architecture enables cross-device acceleration, CPU and FPGA, using a unified abstraction for heterogeneous accelerators. By using the abstraction and a work-stealing scheduler the architecture keeps both the hardware and software processing elements actively working. The authors propose a software toolchain to abstract deployment of work to either ARM or FPGA resources. Compared to SCALENet the focus of enabling work scheduling on either the ARM or FPGA. However, SCALENet is designed to interface with existing DNN software, not define its own. Also, Reference [49] does not address time deadlines nor explanation how, or if, latency and time to the solution are considered.
With the work in Reference [6] , the solution for heterogeneous deployment, CPUs and GPUs, is with the development of a domain specific language (DSL) to map tasking to different devices, OptiML for machine learning. The authors then describe Delite, which is used for a framework, for creating the product of the DSL and a run-time layer. They further differentiate their work by describing three fundamental goals for Delite and OptiML to meet: productivity, performance, and portability and forward scalability. The end product of the work is auto-generated code, from Delite and OptiML, for both the CPU and GPU. While this work focuses on a heterogeneous platform, CPUGPU it fails to provide FPGA compute solutions and requires the re-development of applications in a new higher level language.
Similar to our previous work [32] , but unlike Reference [49] , SCALENet is designed to interface directly with existing scientific framework application deployment and network development software by ingesting developed and trained network and producing an embedded tuned solution. Unlike all other works, SCALENet's hardware accelerator is fully customizable and can deploy any layer or multiple concurrent layers in hardware. It is not designed for just CNN network deployments. Since SCALENet supports a more natural method for creating and adding new network definitions, it can support multiple different architectures. SCALENet deploys a mixture of hardware and/or software DNN Kernels based on the constraints of the designer, which may include having various kernel functions in hardware, CNN, FC, MAX Pool. SCALENet's hardware consists of the ability to deploy the accelerator with one of four different data precisions, enables clock gating for lowering power, and supports an all hardware deployment option. Similar to Reference [49] , the software processor layer implementations are fully parallelized using native ARM Neon intrinsics for optimal throughput and performance. Unlike other designs, SCALENet supports multithreaded Kernel execution in software. Compared to Reference [18] , the metric models are explicitly created for machine-learning applications Kernels and used directly by the heterogeneous aware scheduler. Finally, unlike previous solutions the metric models are built before deployment, by offline benchmarks and simulations that create empirical data, not just theoretical models.
HETEROGENEOUS MODELING
In this section, we describe and quantify our approach to a domain-specific application for heterogeneously deployed DNN applications. We assert that this provides the necessary characteristics of tasking in a heterogeneous execution and provides an effective and efficient methodology for task allocation. In the following description of our method, we will use a simple matrix multiply and accumulate operation as a generalization to represent task characterization, allocation of the problem, and analysis [18, 22] . For the following discussion, we represent software and hardware tasking as separate platforms.
Characterization of Heterogeneous Tasks
Characterizing the execution of a task on a heterogeneous system can be generalized as task profiling, platform benchmark, and task-to-platform characterization [18] . Task profiling is used to identify all possible tasks necessary to execute a given application. These smaller tasks are listed and further analyzed as part of the analytic platform benchmark. These tasks are used to identify the capabilities of each system. Ideally, this benchmark utilizes multiple and distinct tasking implementations to either represent the desired task or closely emulate the task as discovered in the profiling. Finally, the task-to-platform characterization fuses the data from task profiling and task benchmark to create an execution model combined to produce a device performance model.
Domain Metrics for DNN.
In this application, the discovery of domain metrics is through the definition of a design space where we define the quantitative characteristics that a task must provide. For DNNs in this heterogeneous computational space the metrics are:
• Latency: defined as the time from start to finish.
• Resources: defined as the resources necessary to complete a given task.
• Throughput: defined as the rate at which a task is completed.
Regarding DNNs, this could be the time required to execute and complete a MAC operation at various filter and input images dimensions. The resources in this example would be defined by the target type, FPGA or CPU. In an FPGA this would be the number of DSP units, the amount of block random access memory (Block RAM), the number of LookUp-tables (LUTs), and the number of Flip-Flops (FF). For a CPU this would be the operation type, if the NEON/SIMD operations support it, and how many cores are needed.
Metric Models for DNN.
To create a model for metric performance the task inputs are mapped to a domain metrics for a target platform. Where the model maps p real inputs to domain functions to m real metric values, where f is the domain function model, P is the real inputs and M is the real metric values:
DNN Variables and Parameters.
To further refine Equation (7), we define the space P as two disjoint subsets of valid and invalid. Here, P v represents the valid input vector and P i represents the invalid input vector Equation (8):
To illustrate this in the DNN domain, we choose a task that consumes a large percentage of the computational resources. Looking at Figure 6 , which is a computation time break down of AlexNet, the convolutional layer disproportionately consumes the most resources and time. Therefore, for the worked example, a convolutional layer will be used. The DNN model would specify a convolutional layer with a set of inputs that are the correct filter size and would be in the P v space, while a non-conforming filter, P i , would represent filter sizes outside the layer definition from the DNN model. With the input having defined the correct filter, the application domain will explicitly provide what input elements can be varied without changing the correctness of the output. Thus non-explicit elements like data precision or architecture can be varied without affecting the correctness of the results. Thus input elements that can be varied are defined as domain variables and those that cannot serve as domain parameters.
Identifying and
Populating the DNN Metric Models. Section 4.1 defines the metric model mathematically. Due to the highly variable methods available to represent the domain function, we look to a fundamental operation present in many DNNs for a simple representative model. In our case, we use the multiply accumulate function. In the DNN domain a metric model for the convolution of a subsection of the data input and the filter can be expressed as follows:
where A and B are (k, n) in dimension. While k and n are non-zero R numbers. The latency metric for this operation would be represented as a time per-element in a given implementation α:
The effective cost metric for this applying the filter in a given model would be the cost per second on a platform β:
Given Equations (10) and (11), we assert that the solution to f is deterministic and the use of benchmarking will be used to generate the task and platform-specific metric model coefficients. For benchmarking the procedure will generate a set of domain variables and metric variables.
Allocation of the Tasking
Characteristics in the previous section are necessary, but only represent the tasks as isolated instances and not a completed interdependent solution. To address the use of isolated tasking in a cooperative solution, we now discuss how multiple domain metric model functions are combined to create a unified design space.
General
Allocation. The allocation of tasks to heterogeneous computing resources has previously been explored [4, 5, 18, 22] and identified as a series of tasking being partitioned out. It is assumed that the tasks will consume the resources of the assigned platform for the duration of the task and that it will be allocated before the execution. We describe this general objective as the minimization of the makespan of the task, or the total latency between the start of the first task and the final result is returned as described in Section 4.1.1. To represent the makespan minimization problem as an integer problem to be solved for a given design goal, assume the following: P is a set of processors, or platforms, in a heterogeneous system, T is the set of m tasks to be assigned. W(R) is a weighting function that is used to evaluate task implementation resource consumption to account for the variability trade-off for more resources and faster throughput versus fewer resources and lower throughput:
In Equation (12), u and v are arrays of values. Array u represents the number of resources consumed for the task, while array v is the total number of the same resources available on the platform. The execution cost of task i on processor p is
Here, x i p is the execution cost of a task i on processor p, with a dataset of (mxn) and includes the function w(R). Equation (14) is the function evaluation of a task and the communication between tasks:
We combine the execution cost of a task in Equation (13) and the communications cost in Equation (14) and rewrite them. By combining and evaluating the two equations, we can now express p that executes T and has an associated communication cost, c i,j and the cost of consecutive tasks different processors can be expressed as
when task i and task j are assigned to different processors. The final objective is to find an assignment A : T →P that minimizes the sum of execution and communication costs:
when the following conditions are met:
If task i is assigned to processor p, then a i p = 1 and 0 otherwise. n p=1 a ip = 1 is used to assure that task i is assigned to one processor. We expand the simple definition of cost allocation, in Equation (14) to include the communications cost when executing tasks on different heterogeneous processors Equation (16) . The equation provides a more complete representation of the cost when executing dependent tasks between heterogeneous processors.
Multiple Metric Allocation Solution
The working assumption in this method is that all the valued metrics are known before execution. In addition to these, it is possible to provide additional constraints to the optimization. For this to be practical, it is necessary to create multiple metric optimization solutions which represent the heterogeneous platform's domain metric trade-offs. The trade-off is accomplished by changing the allocation of tasks to platforms. For the metric generation, a range of values is provided for all metrics that satisfy the domain-specific application. These ranges of metrics are generated for the multiple DNN architectures evaluated in Section 8. The multiple metric allocation methods represent the complete knowledge space of the application in a domain's knowledge as applied to heterogeneous computing. The domain-specific approach of allocation for DNN provides a solution to the appropriation of tasks to platforms through the balancing and optimization of domain metrics.
SCALENet's heterogeneous modeling, unlike others, is designed to include the limitations of processor to processor communications and is further refined in the next section to include the impact on the total platform. These models are created to specifically address the use of DNNs in low-power real-time systems.
PROPOSED SCALENET HARDWARE ACCELERATOR 5.1 SCALENet Overview
SCALENet, shown in Figure 7 , is a hardware accelerator built with generated HDL and C/C++ targeting heterogeneous FPGA-based SoC for real-time inference DNN applications. It can be deployed as an all hardware accelerator, or a hardware-software heterogeneous implementation. SCALENet's is a flexible architecture that targets both CNN-and FC-based networks and can deploy on the heterogeneity of the SoC architectures by leveraging all its computational elements, known as processing engines (PE), for task execution (CPU and FPGAs). SCALENet uses both FPGA logic, software multi-threading and, when available, NEON SIMD instructions for parallelized CPU execution. The final inference DNN design is created from three inputs, a Torch exported weights and bias model, a trained Torch DNN model, and a series of design objectives. Once all three inputs are provided to SCALENet, it generates a necessary hardware bitfile, configures the communications channel, and provides the necessary Torch mapping file for execution. The following section provides details on the basic scheduler, software, and hardware.
Deployment of DNNs requires a solution to achieve satisfactory throughput while being as efficient as possible regarding power and area. To do so, requires an innovative design that exploits the critical requirements of deploying a network architecture including network flexibility, resource consumption, and power minimization. In the proposed SCALENet accelerator these design aspects are mapped to two main architectural choices in the design: the original basic scheduler and a customizable design feature set.
Scheduler Overview
The basic SCALENet scheduler is a two-part algorithm that balances latency, throughput, and power consumption through both coarse-grain and fine-grain attributes of the system. The goal is to optimize for latency based on available hardware and software models. The algorithm inputs are the desired throughput, the neural network model to use, the target Xilinx FPGA, and the data precision. The coarse portion of the scheduling algorithm takes the specified model and examines the layers and provides the configuration with the lowest latency for the FPGA, and, if possible, a software layer execution. The configuration output creates two configuration files for the second stage: a hardware configuration file and a software configuration file. The hardware configuration file contains the optimal number and types of PEs, data memory offsets for the different tasks, the size of internal memory, and the optimal number of active PEs per layer. The software configuration identifies the Torch commands to, set up the hardware calls for the appropriate accelerated software and hardware tasks. The software configuration file contains the listing of tasks the multithreaded PEs will work on. The proposed heterogeneous scheduler is presented in Section 7.
Customizable Design
The SCALENet accelerator is written in C/C++ for both software task execution and hardware generation with Xilinx VIVADO High-level Synthesis (HLS) tool. The high language flexibility enables customizing each implemented DNN task with minimal hardware-software experience. Thanks to advances in tool design, the overhead associated with well written C/C++ besides the use of Xilinx's pragma the C/C++-generated HDL includes minimal bloat compared to humangenerated HDL. From our own experience, the unnecessary additional logic in HLS-generated HDL can be optimized to within 2% of FF and 3% of LUTs of a traditionally written HDL implementation. The SCALENet iterative data paths accommodate different sized datasets without necessarily calling for re-compilation. Figure 9 provides a layer-wise overview of SCALENet's deployment. The top layer executes Torch, and the customized Torch calls for using SCALENet as an accelerator. The middle layer is the SCALENet software deployment that contains the multithreaded software implementations and the communications with the SCALENet hardware. The final layer is the SCALENet hardware accelerator containing the processing engines and hardware controller (CTRL). In each of the bottom two layers, there is a set of customization parameters, those at compile time and those at run-time, as determined by the scheduler.
Software Layer.
SCALENet's software architecture provides a multithreaded implementation for a fixed number for PEs. Their implementation is optimized to use the processor's native SIMD extensions, NEON on the ARM processor, to accelerate task execution; see Figure 9 . The NEON optimized code is a library released by ARM called Compute Library that currently supports a wide variety of their processors and providing operations for up to 128bits at a time. The library provides optimized NEON code for neural networks tasks including convolution, fully connected, activation, and more. SCALENet only utilizes the library's functions for DNN tasks. Concurrently executing the task processes, there are two communications threads for data and task synchronization between hardware and software. These work as data marshals and initiate transfers or receive DMA with the hardware. They also send to the hardware's CTRL the necessary configurations during run-time for the PE's.
Hardware
Layer. The hardware layer contains three parts: interface, controller, and PEs. The interface consists of both a DMA endpoint engine connected to either the general purpose AXI slave (GP) port of the Embedded ARM or the high-performance AXI slave (HP) port. Using either port is currently limited to only use the 32bit interface. The processor's GP master port is always used to provide configuration details from the ARM processor to the PE controller. The HP port, when enabled allows for a faster and depending on the configuration, wider, path to the processor. When using either the GP or HP interface for data the hardware uses the AXI protocol to move data in and out of the hardware. Data received or sent is stored in the local memory buffer. This buffer is used as a write-through architecture when the next task is present in the software, but as a temporary store when the proceeding task is in hardware. The PE controller is a processor that is used only to provide addresses and PE enable signaling. The hardware configuration, produced by the first stage of the scheduler, is created through the domain variables that are configurable only at compile time and are: the maximum number of PE, the size of a PEs local memory, the data precision to use, and the PE types. The maximum number of PEs is limited by the available resources, which is a variable of the FPGA family. The hardware design template is configured to set the scratchpad memory per PE in hardware. The hardware memory size is increased or decreased depending on the number of PEs, the layer type, CNN or FC, and the DNN architecture and available resources.
Processing Elements
This subsection describes two items that are attributable to both software and hardware PEs. The third, clock gating is only a function of the hardware PEs.
Flexible Parallelism.
In deep neural networks, FC and CNN layers dominate the complexity typically accounting for more than 90% of the computation and memory requirements. Therefore, being capable of supporting and deploying these layers in an efficient methodology is required. SCALENet supports PEs that accelerate both types of layers; see Figure 8 . For convolution, SCALENet utilizes the parallelism output channel tiling that performs convolution across multiple output channels concurrently for a given input channel, [48] . Again, focusing on the convolutional PEs SCALENet supports two different types, direct convolution, and FFT Convolution [1, 2] . The coarse-grain scheduler, to meet an application's latency requirement, will choose either a direct or FFT convolution solution. When deploying a design with convolution, the scheduler may insert at least one FC PE to accelerate any FC layers as necessary. The FP form has the advantage of supporting dynamic range, underflow/overflow conditions, and a non-uniform scale. Using FP is beneficial for neural networks as layers tend to differ in their dynamic range and filter weights usually follow a normal distribution in which smaller values require higher precision data. The downside is that floating-point requires more logic than an equivalent fixed-point implementation. Furthermore, the use of an exponent to support higher dynamic range requires dedicating a subset of bits for the exponent, which leaves fewer bits for the mantissa and hence precision. Figure 10 provides a visual break down of a single PE's hardware resources as a function of the data precision.
Clock
Gating. Xilinx Vivado HLS design tool does not expose user control of the clock enable function for the generated HDL beyond a single disable for all processing resources. In SCALENet, we explicitly enable the clock-enable option in HLS. This option maps all the clock enables for all clocked resources to be driven by a single top-level enable. SCALENet remaps the generated top-level clock enable in HDL to a configurable masking value set by the CTRL for each of the PEs. By remapping the control signal, it enables the run-time design to be controlled, by the fine-grain scheduler, to turn on and off unused PEs during execution. The clock disablement minimizes power consumption at the expense of latency. With the Xilinx FPGAs, this power savings can range from 1% to 20% [39, 47] . 
Hardware Generation
Hardware generation for the SCALENet accelerator is a combination of an XML file, custom scripting and Xilinx's HLS tool. As presented in Section 5.2 one of the products of the scheduler is a configuration file for the generation of the SCALENet FPGA hardware. The XML file contains a series of key-value pairs that represent the hardware configuration the SCALENet scheduler has planned out. A subset of the example-generated XML file based on the network in Figure 7 is below:
Once the XML file is generated, it is processed first by the custom SCALENet scripts and then the hardware generation stage of SCALENet, depicted in Figure 11 . There are two SCALENet custom task stages to the hardware generation: pre-HDL generation and post-HDL generation.
5.5.1
Pre-HDL Generation. The two SCALENet custom tasks consist of four parts: XML parsing, extraction of key-value pairs, modification of HLS TCL file, and clock enable. The first part opens the generated XML and extracts out all key-value pairs that describe the configuration details to configure the scheduled hardware. Next, the extracted table is parsed for valid predefined keyvalues that correspond to the C++ macros defined by #defines in the source code. These valid Internal to the hardware generation is a reference table of the possible #defines macros in C++ and their maps to the XML's key-value pairs. If a key-value pair is not found in the XML, then the scheduler did not populate the pair, and it will take on the default value and not be specified in the next step. Once the mapping for the currently defined macros is complete the hardware generation software opens a Xilinx-generated TCL file that contains all the setting for Xilinx's HLS tool. The TCL file is parsed for the discovered macros, and any that match are updated. If the macro is not present, then the macro definition is added to the TCL file to ensure it present when the HLS design generates the HDL.
Post-HDL Generation.
After the Xilinx script.tcl file is modified and saved, the software executes the HLS tool and specifies the script.tcl. The output of the HLS tool is a hierarchical netlist design specified in either Verilog or VHDL, SCALENet depends on the Verilog implementation. Next, the generated Verilog is analyzed and modified, per Section 5.4.3, to remap the single highlevel clock enable to discrete PE resource groups. The remapping stage analyzes the Verilog, and the Xilinx build logs looking for the Xilinx primitives including DSPs and BRAM that make up the PE. Once found they are grouped, and the stage modifies the Verilog by deleting and creating new wires linking the PE resources to the multiplexed control output of the CTRL. Next, the custom scripts run the Xilinx HLS tool again to create a packaged IP core for further integration. Finally, the software executes the Xilinx Vivado tool, using the Vivado script.tcl, to ingest the IP core and generate an FPGA bitfile.
Comparison to
Other Hardware Accelerators. The SCALENet hardware accelerator provides flexibility in multiple different ways that exceed those in current literature. To begin its ability to be deployed as a standalone hardware only solution or as an integrated accelerator is unique. SCALENet's hardware accelerator has an in hardware scheduler that allows designs with mixed PEs functions to use clock gating to save power when those PEs are not in use. The accelerator scheduler looks for optimum deployment configurations to provide the best solutions concerning design constraints and the desired platform. SCALENet is also built using standard C++ with Xilinx HLS pragmas making it very easy to expand and customize for new and different DNN Kernels.
METRIC MODEL GENERATION
To generate each metric model for the heterogeneous scheduler, we follow the methods outlined in Section 4. The generation begins offline by analyzing multiple DNN architectures and extracting and dissecting each layer into individual tasks. From there, we create a set of platform implementations, hardware, and software, for each extracted task, such as CNN, FC, ReLU, or max-pooling. The tasks are then combined with the model metrics: latency, resources, throughput, and quality, found in Section 4.1.2. That data is then fused to create a task implementation that is referenced by the proposed heterogeneous aware scheduler (Section 7). The scheduler is designed to solve for the design objective of the more refined solution set in the equations of Section 4.2.1. In the following subsections, we provide the design methodology for each platform's metric model's generation based on the tasks described in Section 2.1. Each metric Model is evaluated with the following standard criteria:
• task-what the task is • task_type-what the implementation is • platform-what it is deployed on, hardware or software • latency-time to first solution • data_precision-what data precision is being used • throughput-the number of solutions in one second using the specified data precision These criteria are used by each specific platform metric model, hardware, and software so that the scheduler can evaluate its use in the solution to Equation (16).
Hardware Metric Models
The hardware metric models are created using the post-HLS generation HDL code and combine, through mean and averaging, the pre-synthesis estimates and ten iterations of the post-synthesis results for the given task. The hardware metric model contains these hardware specific implementation details in addition to the general case:
• resources-BRAM, number of DSP, number of FFs, number of LUTs To create the task's domain metric model, we generalize Equations (13) and (12) without the communications cost, Equation (15) , which is added by the scheduler to the final minimization solution when evaluation tasks on a platform. A metric model example of a direct convolution task, identified in Section 2.1.2, implemented with the following configuration in hardware:
• task-convolution • task_type-direct convolution • platform-Zynq 7020 • latency-2.5e-5 • resources-[12; 5; 4,500; 5,800] 10 • data_precision-fp16 • throughput-4,000
Solving Equations (13) and (12) 
Software Metric Models
Software task metric models are similar in their metric models to hardware but differ in the software specific resource tracking. To generate the numbers for these models regarding latency and throughput requires a calculation of mean and average completion time with 100 runs where the execution time is found using the ARM's hardware timer. The additional metrics are:
• neon_resources-Is this accelerated by NEON, if so how many are necessary?
• cores-How many cores are necessary?
• resources_size-How many of a data precision can be processed at once, and what is the max?
We have added a new resource tracker named resources_size to account for the number of samples that can fit in the 128bit ARM's NEON vectors, and to account for the 64/64bit split limitation if it is present in the processor, armv7-A architecture only. A metric model example of a direct convolution task, identified in Section 2.1.2, implemented with the following configuration in software:
• data_precision-fp16 (converted and executed as 32bit float, stored as 16bit)
• throughput-1,000
Solving Equations (13) and (12) for the software task metric model yields
Communications Considerations
With the use of cross platform tasking in SCALENet, successive tasks are occurring on different platforms, the cost needs to include the communications to the total processing time for the task. As we discussed in Section 4.1.4, Equation (11) includes the factor f p p, which represents the cost of moving data between two platforms in terms of added latency. Depending on the destination of the data, the configuration of the SCALENet implementation, and the heterogeneous platform, this can represent one of three possible transport methods: GP port, HP port, and the external memory.
The following sections discuss the method for generating the cost for each of the communications paths on a Zynq 7000 series SoC.
Access Latency for GP and HP Ports.
Of the two different port types, the HP port has has a maximum throughput of (4 * 8 − bytes * 150MHz * 2)/(1,024 * 1,024 * 1,024) = 9.6GByte/s. While the GP port has only (2 * 4 − bytes * 150MHz * 2)/(1,024 * 1,024 * 1,024) = 2.2GByte/s maximum throughput. When evaluating the burst sequential read, and burst write, where the external memory as for heterogeneous tasking synchronization, PL to MEM, MEM to PL, PS to MEM, and MEM to PS. Due to limitations on the SCALENet architecture, only a single HP port is active and only 32bits are being populated and the latency difference between GP and HP ports is negligible. The latencies used in the scheduler are found through benchmarking a 1,000 transactions on each combination. Therefore, the latency of the ports can be estimated as
where PS 2M EM = clock cyles for PS to write to DDR3, MEM 2pl = clock cycles for PL to read from DDR3, PL 2M EM = clock cycles for PL to write to DDR3, MEM 2 P S = clock cycles for PS to read from DDR3.
(20)
SCALENet Heterogeneous Metric Models
The hardware and software metric models are built and applied using empirical data collected through simulation and in benchmark activities. Unlike existing DNN Metric models these empirical-data-based models are used to create heuristic representations in the next section for scaling problem and data sizes and thus in allocating DNN Kernel tasks to hardware or software.
PROPOSED HETEROGENEOUS AWARE SCHEDULER
In Section 5.1 there is a general discussion on the design goals of the original SCALENet scheduler, and in the following subsection, we will clarify the functionality of the heterogeneous aware scheduler that incorporates the original. Figure 12 is a conceptual representation of the heterogeneous scheduler that has been expanded from the original concept [39] to now include a much more thorough analysis of scheduling hardware-software tasks using metric models. From a functional flow, the proposed heterogeneous scheduler has four inputs, a design objective file, which in terms of the metric models are domain variables, the DNN model, a series of metric models for software, a series of metric models for hardware implementations, and not pictured representative metric models for communications. Its output is a software tasking list and a hardware configuration file. Each file is an XML document that contains key-value pairs necessary to implement the desired DNN inference model. Internally the scheduler has changed significantly and now has a clearly bounded set of metric models to base its optimization evaluations on.
General Flow of the Proposed Heterogeneous Scheduler
Minimizing the Cost Function
The heterogeneous aware scheduler is a four-step process where its priority is for finding an optimal solution as specified in the design objectives file, ranks the domain variables of latency, power, or throughput. Figure 13 provides a visualization of the following explanation.
Step one both analyzes the trained Torch DNN model to identify the individual task that needs to be assigned and Fig. 12 . High-level conceptual diagram of the heterogeneous aware scheduler. Inputs to this are design objectives, a neural network model, the target Xilinx FPGA, and the data precision. The output from the heterogeneous aware scheduler is a task list of software functions and a hardware configuration file to provide to SCALENet's hardware generator. Fig. 13 . Internal representation of the heterogeneous aware scheduler. Inputs to this are: design objectives, a neural network model, the target Xilinx FPGA, and the data precision. Internally, the scheduler evaluates the model representations for the FPGA, software, and communications to identify the tasking that meets the desired design objectives. The output is a task list of software functions and a hardware configuration file to pass SCALENet's hardware generator.
locates the identified tasks from both platform's metric models. Once these have been located the scheduler runs a greedy algorithm implementation searching for the first set of tasks that satisfies the Equation (16) without the communications latency, L com . If the assigned platform tasks costs plus the cost of the communication latency meets the design constraints, then it moves on to step four. If the chosen platform tasking does not meet the design constraints, then the scheduler saves off the current solution, sets a cost function goal of the current minus the addition latency, and loops iterate back looking for new task metrics models. This will continue until one of three possible outcomes occur. If there is a solution found, then the scheduler saves off the specific platform tasking, generates the software and hardware configuration files, and exits. If no heterogeneous solution is found, then the scheduler will go back and try to solve for an all hardware tasking and if found save off the two artifacts. If, however, there is no possible solution, then the scheduler will review the current and past tasking solutions for the closest two implementations and provide those.
SCALENet Heterogeneous Scheduler Comparison to Existing Solutions
Unlike current solutions for schedule driven co-designs, SCALENet is precisely tuned and implemented for DNN Kernels and CPU+FPGA platforms. The scheduler is composed of both an offline scheduler and online scheduler handing two different aspects for DNN deployment. The offline scheduler uniquely evaluates and incorporates multiple design objectives and metric models to target the best solution for low-power embedded architectures. The process involves looking for the most power efficient solution that utilizes all processing technologies available, CPU and FPGA, to maximizes the performance per watt. It does not necessarily solve only for the highest throughput solution at the cost of more resources and higher power consumption. The second part unique to SCALENet is its online hardware scheduler, which enables power savings at runtime by utilizing PE-based clock gating for unused or underutilized PEs and cache. This allows for an execution time layer by layer tuning of active and inactive resources.
EVALUATION
A principal aspect of the proposed accelerator and scheduler to be evaluated is the impact of the design on classification throughput, energy, and area utilization. For Section 8.2, SCALENet is deployed heterogeneously with the division of tasking determined by the scheduler. While in Section 8.4 SCALENet is evaluated as a hardware-only implementation and heterogeneously.
Targeted Platforms and Measurement Setup
To better gauge the performance of the proposed accelerator, we evaluated SCALENet when running on an FPGA-based SoC platform for all network architectures. The FPGA platform is the Zedboard containing a Zynq-7000 SoC with dual-core ARM Cortex-A9 and FPGA. For comparison, we include NVIDIA's low-power Jetson TK1 platform comprising TK1 SoC with quad-core ARM Cortex-A15 and K1 GPU. For each platform, the baseline references are the corresponding CPUs running the networks in the scientific computing framework Torch supporting CUDA 6.5 and CUDA Deep Neural Network library. Figure 14 shows the platforms and evaluation setup. For each platform, we measure in real time both the execution time and power consumption required to classify sample data. To accomplish this, we actively record these metrics for 1,000 samples and then averaged to derive the per classification performance. For power results, we look to only measure the consumption of both the processor and external DDR memory. We utilized an Arduino Uno driving a TI INA219 voltage and power IC connected to each system's primary power rails to ensure consistency. For each platform, great care was taken to disconnect and power down all other peripherals including HDMI, debug circuitry, and Wi-Fi/Bluetooth.
Image Processing
CIFAR-10
Dataset. The CIFAR dataset consists of 60,000 32 × 32 color images with ten classes. While the input images are small, achieving good accuracy still requires having reasonably deep neural networks. For this dataset, we target four neural networks including VGG-A, VGG-D, modified SqueezeNet, and Sparse VGG-D [17, 41] . The standard SqueezeNet is designed for the ImageNet dataset with images of size 224 × 224. We modified the SqueezeNet network for CIFAR-10 by removing a set of convolutional layers and reducing the final pooling layer to cover 4 × 4 receptive field versus 14 × 14. Sparse VGG-D is the same network as VGG-D but with three generalized sparsification techniques applied [32] . Figure 15 provides a comparison of the four networks regarding computation, memory, and top-1 test accuracy. Both SqueezeNet and Sparse VGG-D require significantly less computation than VGG-A and VGG-D, relatively, while achieving similar accuracy performance. Sparse VGG-D can reduce computation and memory by 60% and 93%, respectively.
ImageNet
Dataset. The ImageNet dataset consists of 10 million colored images scaled to 224 × 224 with 1,000 assigned classes. The specific dataset used comes from the 2012 Large-scale Visual Recognition Challenge (ILSVRC). ILSVRC is a much more difficult dataset and requires a higher complexity network to achieve good accuracy performance. For this dataset, we target four neural networks including AlexNet, Sparse AlexNet, SqueezeNet, and Inception v3 and the comparison is shown in Figure 16 . Sparse AlexNet is similar to AlexNet with three sparsification techniques applied in addition to filter factorization and replacing the FC layers with two 3 × 3 convolutional layers and a spatial pooling layer [32] . Sparse AlexNet, SqueezeNet, and Inception all require substantially less memory than AlexNet by replacing FC layers with convolutional layers. SqueezeNet and Sparse AlexNet, in particular, need 49% and 21% less computation than AlexNet, respectively. Inception requires 635% more computation than AlexNet but can dramatically increase top-five accuracy by 14%. 
Heterogeneous Implementation Evaluation
For the following evaluation, the image datasets and their respective DNN implementations are optimized by the SCALENet scheduler to create a lowest power heterogeneously deployed implementation. In the case of the CIFAR-10 and ImageNet datasets the SCALENet scheduler design objectives ranked power as the dominate objective with throughput as the secondary choice. These design objectives, when targeting the Xilinx Zedboard, created a heterogeneous design that split larger convolutional layers, such as found in the Inception DNN, between hardware and software. Lower computational complexity layers, such as pooling, where implemented as software kernels. In all cases the CIFAR-10 and ImageNet networks utilized direct convolutional layers instead of FFT-based convolution. Table 1 is a comparison of using SCALENet, accelerator and heterogeneous aware scheduler, to accelerate nine DNN architectures versus using just the ARM CPU. On average implementations using SCALENet provide a performance increase of almost 10x while also reducing power by 4.8×. The first four networks correspond to the CIFAR10 dataset trained networks that have varying degrees of memory requirements and computational requirements. The next four networks are the ImageNet dataset trained networks. The last dataset, Seizure, is discussed in the section below and provides a real-time processing use case for the SCALENet accelerator. Calculations for image throughput are found using an average time per image when processing 1,000 images one at a time, batch = 1. The execution time for an image is the total time for 1,000 images divided by 1,000. The power consumption is found by measuring the power over the same time period of processing of 1,000 images, and dividing the total power by 1,000 to find a per image power consumption. To calculate the improvement between the CPU only implementations and the SCALENet accelerator, we used the following formula:
Seizure Detection
NNs can be trained to process time series data (e.g., EEG data) for certain tasks (e.g., seizure detection), and have been extensively studied in the literature [16, 19, 23, 24] . Using our previous work in time series seizure detection as a basis for EEG detection [34] , our architecture is a reduced ResNet-20 with 13 convolution layers and replacement of the last two convolution layers with FC layers. The choice of ResNet is due to our experience in its performance in previous research and its low memory requirements for parameters. The input matrix is a fusion of 24 EEG channels from Reference [12] dataset. Here, time series samples, 24 consecutive from each channel, are fused to form a 24 × 24 matrix, and a 3 × 3 filter is applied. The set of fused channels is defined through a sliding window that moves from 1 sample to 24 samples per window, for 24 channels. At the start and end, when there is not enough data to fill the window, samples are set to 0 and shifted in or out. Each matrix of samples passes through the modified network as an image. The constructed model had an accuracy of 86.98%, which is similar in accuracy to Reference [3] . We implemented the ResNet network using both Direct and FFT convolutional PE using SCALENet. The inputs to the SCALENet scheduler where: the modified ResNet model, the latency, the one sample every 3ms for a 256Hz EEG, a Zynq 7020, and 16bit FP. The scheduler guided a design for a pure hardware implementation of 32 PEs for the Direct Convolution and 12 FFT Convolution PEs for the FFT-based solution and in both cases 2 FC PEs. Previously, the original scheduler [39] , due to the latency requirement of 3ms, was unable to find a heterogeneous solution that could meet the real-time deadline of 3ms. However, the proposed heterogeneous aware scheduler was able to find a solution set that used both platforms. The last three rows of Table 1 compare the three solutions. The design objective to meet the real-time deadline the resulting solutions produced a decision every 3.05ms, 2.01ms, and 3.15ms for hardware direct, hardware FFT, and heterogeneous, respectively.
COMPARISON 9.1 Comparison with COTS
We compared SCALENet against NVIDIA's Jetson TK1 mobile GPU platform on all networks. Figure 17 shows the efficiency results of running on the three platforms for the four CIFAR trained networks with and without accelerators. All platforms, there is a large increase in efficiency with the accelerator. The SCALENet accelerator on average can improve efficiency over its host processor by 3×, whereas the K1 GPU can increase efficiency by 4.5× over its ARM CPUs. Figure 18 similarly compares the efficiency of the FPGA and GPU platforms when targeting the ImageNet networks. The Zedboard with the SCALENet accelerators can achieve an average increase in efficiency of 9× whereas the TK1 accelerated with the GPU can increase efficiency on average by 5×. In Table 2 the last row outlines the power efficiency for the platforms. The savings for using the SCALENet architecture, combined with the proposed heterogeneous scheduler, are compared to an ARM Cortex-A9 processor and the TK1 running Torch compiled with CUDA 7.5 and the OpenBlast libraries. The execution time improved by 99.7% while the power improves by 74.10%. Table 2 Compares the Proposed Heterogeneous Scheduled SCALENet with Existing Well Known FPGA-based CNN accelerators. The network complexity is calculated as shown in Equation (22) . Equation (23) calculates the platforms performance for a network in terms of Giga Operations per second (GOP/s). Finally, the calculation for a platform's energy efficiency for a network is show in Equation (24) . 
Platform s Enerдy Efficiency N etwork = Platform Performance N etwork Power Platform . Deployed on Zedboard platform with Cortex-A9 running at 667.67MHz and FPGA running at 150MHz, the scheduled accelerator achieves an energy efficiency of 5.70GOP/J with total system power of 2.07W and throughput of 15.92GOP/s. SCALENet delivers higher energy efficiency for a given network complexity than the previous best accelerators [11, 48] , while targeting a much lower power utilization. Compared to Reference [11] , which uses a the same family Xilinx SoC, SCALENet is able to provide a much lower operating power consumption due in part to its offline and run-time scheduler. The scheduler enables the most computationally complex layer's work load to be shared between both hardware and software. This ensures that the design optimizes the power consumption by keeping the embedded processors busy. The scheduler also allows SCALENet to utilize clock gating on idle PEs on a layer by layer basis decreasing power consumption during execution whenever possible. Additional gains in SCALENet's efficiency are achieved by exploiting the designs native functionality to perform fused convolution, batch normalization, and ReLU without having to move data of the executing platform between these layer function executions.
CONCLUSION
In this work, we proposed contributions in three enterprises: a domain-specific metric model for DNN tasks, the proposed SCALENet hardware accelerator, and the proposed heterogeneous aware scheduler that optimizes task allocation based on the DNN task metric models. In the first contribution, we step through the creation of the domain-specific metric models and the creation of the solvable equations necessary to evaluate different platform allocated tasks. In the second contribution, we proposed SCALENet, a SCalable Low-power Accelerator for real-time deep neural Networks hardware accelerator, for its flexible DNN support with acceleration for CNN and FC layers hardware and software. For the heterogeneous aware scheduler, we describe how the metric models are created for each of the DNN tasks and provide an example for both hardware and software. We then discuss the details of the way the scheduler solves for the allocation solution based on the domain objectives. After discussion of the theory, we evaluated SCALENet's scheduled hardware accelerator on a ZedBoard with nine different DNNs, eight image processing networks and one custom time series network for biomedical seizure detection in EEGs. We demonstrated that the proposed heterogeneous scheduled solution was able to meet the real-time deadline the EEG imposes, where in previous work it was only possible in a hardware-only solution. We compared the proposed hardware accelerator and scheduler deployed with the highest complexity network evaluated, Inception v3 with 11.78 GOP, on the Zedboard platform with a dual-core ARM Cortex-A9 running at 667.67MHz and FPGA running at 150MHz, and achieved an energy efficiency of 5.70GOP/s/W with a total system power of 2.07W and throughput of 15.92GOP/s. The proposed solution produced higher energy efficiency than the prior best accelerators as well as Jetson TK1 while targeting a power profile that is more than 4× lower than the Jetson TK1.
