The rise of deep neural networks (DNNs) is inspiring new studies in myriad of edge use cases with robots, autonomous agents, and Internet-of-things (IoT) devices. However, in-the-edge inferencing of DNNs is still a severe challenge mainly because of the contradiction between the inherent intensive resource requirements and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices is not an effective solution in edge domains. Therefore, to benefit from available compute resources with low communication overhead, we propose new edge-tailored perception (ETP) models that consist of several almost-independent and narrow branches. ETP models offer close-to-minimum communication overheads with better distribution opportunities while significantly reducing memory and computation footprints, all with a trivial accuracy loss for not accuracy-critical tasks. To show the benefits, we deploy ETP models on two real systems, Raspberry Pis and edge-level PYNQ FPGAs. Additionally, we share our insights about tailoring a systolic-based architecture for edge computing with FPGA implementations. ETP models created based on LeNet, CifarNet, VGG-S/16, AlexNet, and ResNets and trained on MNIST, CI-FAR10/100, Flower102, and ImageNet, achieve a maximum and average speedups of 56x and 7x, compared to originals. ETP is an addition to existing single-device optimizations for embedded devices by enabling the exploitation of multiple devices. As an example, we show applying pruning and quantization on ETP models improves the average speedup to 33x.
I. INTRODUCTION & MOTIVATION
In-The-Edge Inferencing: The advancements of deep neural networks (DNNs) have made revolutionary changes in domains such as robotics [1] [2] [3] [4] [5] unmanned aerial vehicles (UAVs) [6] [7] [8] [9] and Internet-of-things (IoT) [10] [11] [12] [13] [14] [15] . In such domains, in-the-edge inferencing is rapidly gaining the ground, due to ubiquitous wireless networks, the availability of embedded processors. This paper targets in-the-edge inferencing in environments such as smart home/city/office (e.g., connected cameras, game consoles, TVs, routers) or collaborative robots/drones (e.g., disaster relief [16] [17] [18] , agriculture [19] , farming [20] , mining [21] , construction [22] , mapping [18, 23] ), in which: (i) accuracy is not the ultimate goal (e.g., detecting human sound in a disaster area with either 87% or 90% accuracy necessitates more investigation); (ii) the network of devices is standalone (i.e.., Internet connection is not necessary and no data is offloaded for the sake of security. e.g., military applications); and (iii) the network has unified ownership hence data communication within devices does not hazard privacy, security, and monetary cost. The Key Challenge: Privacy concerns [24] [25] [26] , unreliable connection to the cloud, tight real-time requirements, and personalization are pushing inferencing to the edge. Despite such numerous driving forces for in-the-edge inferencing, the key challenge is that fast inferencing requires high compute resources and memory demands [27] that contradicts the limited energy and computational resources of edge devices (i.e. resource-constrained devices [28, 29] ). Such demands are not expected to slow down as modern models [30, 31] encapsulate more parameters for a better generalization. Current Solutions & Limitations: (1) The first widespread approach is to offload all computations to high-performance servers of cloud providers [32, 33] . However, cloud-based offloading is not always available (e.g., no Internet access) and often relies on unreliable network latency. Furthermore, with the exponential increase in the number of edge devices [34] and the scale of raw collected data, centralized cloud-based approaches might not scale [35, 36] . (2) The second approach to deal with the limited resources on edge devices is distributing computations by taking advantage of the existing surrounding devices such as cameras and other mostly idle devices. Distribution is based on common dataor model-parallelism methods [37, 38] . In data parallelism, the entire model is duplicated on each device for performing separate inferences. Hence, the system needs several live and concurrent inputs to be efficient without real-time jitter. Simply put, data parallelism only increases throughput. In model parallelism, the model is divided and distributed across several devices for the same inference. Neither data-nor modelparallelism jointly reduce communication, memory usage, and computations, as Table I depicts in details. Moreover, although model parallelism could decrease execution latency in theory, in reality, it incurs high communication latency. Our Solution: To address resulted high latency with model parallelism, we explore an efficient model distribution method, edge-tailored perception (ETP) models that: • Reduce Communication: ETP models replace a single, wide, and deep model with several narrow ones that only communicate for input and pre-final activations. Thus, their communication load is low with distributions (see Table I ). • Reduce Compute & Memory Footprints Per Node: ETP models have fewer connections than that of the original ones, so their number of parameters and computational demands are also lower than those of the peer modelparallelism versions, shown in Table I . • Allow Inter-Layer Parallelism: Narrow branches in ETP models are independent of each other, which enables interlayer parallelism. This is in contrast with model-parallelism methods that only allow intra-layer parallelism due to the single-chain dependency between consecutive layers. With ETP, our goal is to illustrate the advantages of models designed with the characteristics of their underlying computational domain in mind.
To restore a possible accuracy loss caused by fewer connections and parameters, we create enhanced versions of ETP models with a similar accuracy (within 3%) as the original models, and only a fraction (5%-24%) of their compute and memory demands. Such models enable locally faster execution with a slight accuracy loss for tolerable and common tasks (e.g., counting the cars passing an intersection). If a task requires high accuracy (e.g., finding a specific license plate), the system relies on existing offloading (1) cloud-based, (2) collaborative [39] , or (3) fog-based [40] technologies.
ETP is orthogonal and an addition to current techniques for reducing the computational demand of models, such as weight pruning [27] and quantization [41] . ETP models offer distribution/parallelism opportunities for distributed computing. Whereas, current techniques apply accuracy/performance tradeoffs to single-node models with a single-chain of interlayer dependency. Such techniques can be applied to each branch of ETP models, as shown in §IV-C. Hence, ETP models complement such techniques rather than compete with them. Experiments Overview: (1) We design, train, and evaluate ETP models based on computer vision DNNs on MNIST [42] , CIFAR10/100 [43] , Flower102 [44] , and ImageNet [45] datasets (total of 50 training results). (2) To evaluate the execution performance of ETP models, we conduct real-world implementations on two systems: a system with up to ten Raspberry Pis (RPis), and another with two PYNQ boards. RPis are chosen because they represent the de facto choice for several robotic and edge use cases and they are readily available [46] [47] [48] [49] [50] . (3) We tailor a TPU-like [51] architecture for edge computing and share our findings with an edge-based FPGA. We tailor the microarchitecture for edge computing by maximizing data reuse and enabling data streaming from the memory. We enhance the microarchitecture with a data-driven execution model that eliminates the overheads of instructions in TPU. (4) Finally, we estimate the area and power for a 28 nm ASIC for the tailored architecture. Contributions: Our contribution are as follows:
• We propose the first DNN optimization technique to reduce the communication overhead in a distributed system for in-the-edge inferencing. • We propose ETP models that enable inter-layer parallelism for in-the-edge inferencing while reducing the total memory and computation footprints. • We conduct real-world experiments on Raspberry Pis and PYNQ boards, and share our findings on an edge-based FPGA for an edge-tailored TPU-like architecture.
II. CHALLENGES OF EDGE COMPUTING
A. Growing DNNs & Resource Limitation DNN models consist of several layers, each of which performs specific computations. The computations are based on custom weights that are learned during the training phase with back-propagation. In the inference phase (i.e., prediction), feed-forward computations are performed on batched inputs and learned parameters stay constant. For edge computation, the most compute-and data-intensive layers [52, 53] are fully-connected and convolution layers. 1 In fact, DNNs are inherently compute-intensive; Figure 1 shows the amount of multiply-accumulate operations and parameter size in several DNN models. The left bars illustrate basic models such as LeNet [54] and CifarNet [43] . On the right side, we illustrate the YOLO [55] and C3D [56] models that are used for videos. The newest translation model, Bert [30] , significantly surpasses all the previous models in both parameter size and computations. As shown, newer models encapsulate more parameters and perform more computations for better and more generalized feature understanding than their predecessors. In short, this trend of modern models will inevitably surpass the capabilities of any resource-constrained device. 
B. Single Device Pareto Frontier
The capabilities of resource-constrained platforms are limited. To gain a better understanding, Figure 2 depicts latency per image for state-of-the-art image recognition models for ILSVRC 2012 challenge [45] on RPi [57] . All implementations heavily utilize modern machine learning optimizations such as pruning [27] , quantization, low-precision inference [41, 58, 59] , and handcrafted models [60] . Additionally, the models are highly optimized for ARMv8 architectures using the ELL compilation tool [61] . However, achieving higher execution performance is impossible on a single device due to the Pareto frontier. As seen, the latency for highaccuracy models is longer than 400ms, and generally, latencies are longer than 100ms. In addition, the data shown in the figure is only for image-recognition models and DNNs in other domains are already surpassing these models in size and complexity. Fitting such an exponentially increasing computation on a single device, especially for edge devices, is a limiting factor for executing DNNs in the edge. In other words, Fig. 2 . Latency-Accuracy Pareto Frontier -Single device: Latency per image on RPi3 for state-of-the-art ILSVRC models with the optimized platform-specific compilation ELL [61] tool [57] . Multiple devices: Breaking the single device Pareto frontier, but with significant communication overhead. even after applying all optimization techniques for DNNs in embedded systems, the single device Pareto frontier is still limiting the widespread applicability of DNNs in several inthe-edge applications.
C. Current Limitations of Distribution Methods
(1) Data parallelism parallelizes the computations of independent inputs. Among distribution methods, data parallelism [37, 38] keeps computation and memory footprints per device similar to the original DNN (see Table I ). Data parallelism does not apply to the edge because: (i) serves several independent requests, that are limited in an edge environment; (ii) does not reduce latency, important in several real-time applications in the edge; and (iii) does not reduce computation and memory footprints per device.
(2) Model-parallelism methods divide the computations of a DNN model for a request. These methods first divide the computations based on layers in a model and then internally divide the computations within each layer, by keeping dependencies intact. Depending on the type of the layer, the dividing could take several forms. In Figure 3 , we provide a simple example for distributing a fully-connected (fc) layer. The computation of an fc layer is as i x i w i + b, in which w, x, and b are weights, input, and biases, respectively. We can write this computation as matrix-matrix multiplication, or Wx + b. There are two extremes of model parallelism, input and output splitting [5] . In output splitting, producing each set of outputs is divided among the devices. In input splitting, the input is split and each device computes all parts of the output that are dependent on their received input. As shown in Figure 3 , each technique has specific communication overhead. Output splitting requires the transmission of the input to all nodes. Input splitting requires the transmission of partial sums to a final node for summation. New model-parallelism methods can also be crafted by mixing these two extremes, but they similarly suffer from the same discussed overheads.
Several model-parallelism techniques also exist for convolution layers by converting their computation to matrix-matrix multiplication [62, 63] , and they are similar to the example provided for fc layer. In summary, since model-parallelism techniques do not change the internal network connections of a model, after distribution, we need to keep the dependency chains intact. Hence, although model parallelism reduces the compute and memory footprint per device, the communication overhead resulting from the tightly interconnected layers and inter-layer dependencies stays the same as the original model.
D. Communication Challenges
Dependability current distribution methods on the high amount of communications induce the straggler problem, in which a system is lagged by its slowest node. Specifically, since edge devices usually use a wireless or mobile network, the latency deviations are high. Figure 4 depicts the histogram of prediction latencies on a distributed IoT system consisting of six RPis executing AlexNet [64] with model parallelism. The computing time is bounded to 500ms, but the average delay is ≈ 2x longer (and ≈ 4x for tail latency). To gain perspective, Figure 5a and b depict the VGG-S model and its distributed version with model parallelism, respectively. The VGG-S model has a similar parameter size and compute density as AlexNet [37] (Figure 1 ) and it is designed for the Flower102 [44] dataset. As seen, dependencies enforce a strongly interconnected network among the divided parts. Although several techniques such as compression could alleviate the cost of communication, they do not reduce the number of connections. As a result, the characteristics for a target distribution method on edge devices besides yielding low memory and computation footprints per node must reduce the number of connections.
Relying extensively on communication in model-parallelism techniques also imposes another challenge: finding an optimal distribution. This is because each distribution depends heavily on communication and network traffic that changes over time. Hence, a single distribution is not always the best answer and actively profiling the distribution in real-time is necessary. In particular, finding an optimal distribution is an NP-hard problem. Since our distribution reduces the dependability to communication, it reduces the complexity of the search space. In fact, our models only need to communicate for the input and before the final classification layer activations. Additionally, ensuring reliability in our models is easier than model-parallelism versions due to the same reasons discussed.
In summary, with parallel execution on multiple devices, ideally, we could pass the frontier. But, as shown in Figure 2 , current distribution methods are limited by the communication overhead and the inherent inter-layer data dependency in models. Next section proposes ETP models, which significantly reduce communication and allow inter-layer parallelism.
III. EDGE-TAILORED PERCEPTION
This section first provides details on ETP models, discussions about their key features, and their design procedure. The second part of this section focuses on insights about how to tailor a TPU-like systolic-based architecture for edge computing. Finally, the last subsection presents details on the microarchitecture design.
A. ETP Models
Simple Example of ETP Models: We illustrate the design procedure of ETP models by following our simple example in §II-D. To design ETP models, first, we divide each layer in the original model in equal parts (e.g., split in 2) and then remove their intermediate connections. Figure 6a shows a new model, based on the VGG-S model. As shown, by removing the intermediate connection between the two branches, we create two independent and parallel branches. We only keep the input and pre-final layer connections so that the model acts as a single model. Since, in our example, the size of each branch is half of the original model and has fewer connections, the new model has less than half of the parameters of the original model. Moreover, the communication is only needed for the input and before the final prediction layer activations. In fact, Figure 6b illustrates an example distribution with the new model. The computations of the final node can be a new node or even the user's device. Later in this section, we provide our design procedure to generalize this simple example. As discussed in Table I , ETP models have low memory and computation footprints, while their communication per inference is significantly reduced with the distribution. Key Features of ETP Models: ETP models are designed by considering their underlying computation domain and have the following key features. (1) ETP models only communicate for input and pre-final activation. Therefore, they significantly reduce communication overhead in a distributed system. Additionally, the low communication load per inference helps with the straggler problem. This is in contrast with model parallelism that highly depends on communication between all the intermediate layers.
(2) ETP models split the size of a layer, so the total parameter size and computation complexity of the model is reduced. Therefore, ETP models require fewer parameter size, less computation complexity, and no communication between the nodes for intermediate layers.
This lower memory and computation footprints allow edge devices to efficiently operate within their limited resources (e.g., no swap space activities due to limited memory). (3) ETP models replace the original wide model with several narrow and independent branches. Since the computations of branches are not dependent anymore, in contrast with the single-chain of dependency in the original model, a distributed system can concurrently execute all branches. In other words, ETP models allow us to go beyond intra-layer parallelism in a model. Design Procedure of ETP Models: Figure 7 describes design procedure of ETP models. We start by inputting our DNN model architecture and its per-layer memory and computation footprints. Similarly, we input the specification of the hardware, such as its memory, its computation capability, and also any overhead that is associated with executing a DNN on our hardware. For instance, several DNN frameworks have a memory overhead on Ubuntu that runs on Raspberry Pis. A splitter procedure, the algorithm of which is written in Algorithm 1, in a while loop, splits the model, cuts the connection, and measures the approximate footprints of each branch. The Division Factor , a hyperparameter, defines the granularity of division/splitting. The loop exits when a single branch is fit on a device (both memory and computation wise). Removing non-branch connections is a simple operation that keeps only one connection per layer. The derived model from the splitter is the split-only model. By training the split-only model and testing it, we measure its accuracy. Then, depending on the accuracy requirement of our task, we either fatten each branch by F %, a hyperparameter, or output the model. If we decide to fatten each branch by F %, since we are introducing new weights, we retrain the new split-fattened model. We choose 9 return <DNN> the best accuracy among transfer-learning and from-scratch versions during such retraining. The process continues in a loop until we are within an acceptable error range for our task, or ε, a hyperparameter. Finally, we output the model and its weights. We showcase various ETP models with 5 datasets and 8 models, including ImageNet in §IV-A.
B. Tailoring Hardware for Edge
Current Deep Learning Accelerators: Systolic-array [65] based designs are the key hardware design for executing DNNs with advantages such as a high degree of concurrent processing through a dataflow compute arrays [51, 66? ? -69] . TPUs are one of such successful accelerators with widespread usage in the industry. However, datacenter-Level accelerators, such as TPU, Eyeriss [67] and EIE [? ]), are tailored for dataparallelism, which increase throughput by compromising endto-end latency. In datacenters such tradeoffs are valid, but our target is reducing single-batch inferencing latency, specific for edge computing with limited requests. Moreover, these accelerators are dependent on large on-chip SRAMs and complex instruction-based execution. Such resources are limited in the edge due to energy constraints and cost considerations. ETP Models Promote Simpler Edge Hardware: Besides the benefits discussed in §III-A, ETP models encourage less complex hardware. Since each branch of ETP models is divided with the constraints of hardware in mind, they require fewer parameters, less computation complexity, and no communication between the modules for intermediate layers (hence, less constraint on fast data rates). As long as the original model is divided into common-size branches, as algorithm 1 targets, the microarchitecture does not need to be separately fine-tuned for each model to achieve the best execution performance. We explore the effects of such simplifications on a specialized hardware for edge devices. We tailor a TPU-like systolic-array architecture to study the benefits. We provide our insights in the rest of this section. Section IV-C shares our implementation experience with an edge-level FPGA. Key Insights to Simplify Architecture: While we borrow the main idea of using systolic arrays for inferencing from TPU, our design differs from that in the following key insights to make the design suitable for edge implementation.
• Small Multiplier Array: Since the computation complexity in branches of ETP models are within a specified range, we use a smaller multiplier array, 32x64 compared to 128x128.
Hardware Specification
… … … … … … … … … … Adder Tree
Sy sto lic

Ar ray
Splitter Desired Accuracy?
Classification Layer Predictions
Test Set
Testing
Yes
Split-Only Model
Training Training Set
Hyperparameter Tuning
Fatten +F%
Each Branch
Final ETP Model
Split-Fattened Model
No • Data-Driven Execution Model: With a data-driven execution model, as opposed to sophisticated mechanisms for executing instructions, we eliminate the overheads of decoding and executing instructions. • SRAM-Free Design: We utilize burst accesses to memory (DRAM) for streaming data through the systolic array with a memory-mapped design. Therefore, we remove costly SRAM buffers, common in prior work (e.g., TPU). The SRAM-free design is low cost and tailored to edge. • Separation of Multipliers and Adder Trees: Unlike the typical MAC-based systolic arrays in TPU (and Eyeriss & EIE), such separation in the interconnection facilitates partitioning and pipelining the operands. • Simple Peripheral Logic: We use a simple shift-andincrement logic for indexing data and directing data to/from systolic arrays. • Low Clock Frequency & Memory Bandwidth: We tailor architecture for the edge by using a low-bandwidth off-theshelf LPDDR2, common for edge devices. Additionally, we use a super low-speed frequency (i.e., 100 MHz).
C. Microarchitecture Details
Overview: Figure 8a illustrates the main microarchitectural components of our design that comprise a weight-stationary systolic array [65] for implementing matrix-matrix multiplication. The following introduces our six design choices and explains how they align with the philosophy of processing independent splits and the goal of reducing single-batch latency. For instance, the structure of the multiplier array connected to the adder trees (which enables flow of data in one direction rather than two) together with the simple standalone indexing logic not only facilitates partitioning (e.g., Figure 8b ) but also reduces latency from O(n) to O(log(n)). Besides, since the width of the systolic array defines the degrees of memory parallelism, we chose a width of 64. This approach efficiently utilizes the memory bandwidth with DRAM burst reads and makes the efficient use of memorylevel parallelism and DRAM burst reads while utilizing the maximum connections to the memory. 
S y s t o li c A r r a y
A1 A2 A3 A4 32 
32
A ⇥ B | (1) Systolic Array Cells: The systolic array cells are organized in a 32x64 array . Each cell includes a multiplier with two integer operands, one stationary and the other streaming (R1). At each cycle during computations, all of the multipliers are active, working synchronously on streamed data. R1 registers with streaming data are connected in a column within the array such that at each cycle their contents shift one row down. To reduce the connections between the array and the main memory, only the first row of the systolic array is connected to the memory . Moreover, each cell of the first row is only connected to one data stream line . Based on the type of an operand (t), streaming data is either used for the initialization or multiplication. The buffers, storing stationary operands, are connected similarly to that of streaming registers. During the initialization, stationary operands are poured into connected buffers in a column to fill them by utilizing the connection between them.
(2) Stationary Operands: The stationary operands are often larger than the dimensions of the array. In such cases, we have to partition a multiplication into several small ones (as shown in Figure 8b ), more than one of which may share a streaming operand, but have distinct stationary operands. To avoid multiple loads of stationary registers, we choose to integrate a buffer for stationary operands at each multiplier (negligible overhead). As a result, the design serves requests with lower latency. Moreover, since each branch of the model has several layers, integrating these buffers allows a fast context switching without the overhead of reloading the stationary operands.
(3) Adder Trees: Each row of the array is connected to an adder tree . The number of adder trees, which is the same as the depth of the systolic array, defines the number of output elements generated at each cycle. Adder trees, pipelined in five (log 2 32) stages, reduce the result of multiplications to a single integer, which then contributes to creating an output element. During multiplications, the outputs of the multipliers in a row are routed to the adder trees to be summed. To maximize the fine-grained parallelism, we want the width of the 2Dweight matrix to match the number of adder trees, or depth of the systolic array. To enable flexibility and fast in-the-edge execution, we employ a narrow array with a depth of 32.
(4) Memory Mapping: To perform DNN computation, similar to prior work, we convert the computations of DNN layers to general matrix-matrix multiplication (GEMM) [62, 63] . Then, we create a memory layout, as shown in Figure 8b . To assist the smooth streaming of data from memory to the systolic array, we map data to sequential addresses in the physical memory. Figure 8b shows an example of GEMM (A × B ) , the operands of which are stored in the memory. Since the depth of the systolic array is 32, we divide each operand into 32-element chunks and map them to consecutive addresses with their block index (i) and type (i.e.., for discerning the operands that reside in R1 registers or buffers). The length value of the streaming operands is also saved along with the operands. Such partitioning is done by a simple and low overhead heuristic algorithm (as shown in Figure 8b ), partitioning both A and B matrices across their common dimensions. As illustrated, the width of the systolic array defines the level of memory parallelism. Unlike prior work, the array in our design is directly connected to the memory. Since our design stores and reuses the intermediate results within the multipliers, we remove large-SRAM buffers in similar prior work such as TPU, Eyeriss, and EIE.
(5) Simple Indexing Logic: During execution, for each element, the indexing logic () generates the appropriate row and column indices of the element using the index (i) and the length to accompany the result. The row and column indices will later be used by the memory interface to write the result to physical locations in memory accordingly. By comparing the length and index (i), the end of the operations in the current layer is detected. The end of the current layer signals the start of activation and pooling functions () for that layer. Since the stationary operands might not fit in R1 registers, as shown in our example in Figure 8b for B, we save partial sums into the memory and later perform the final summation by reading the partial sums. The reading from memory is interleaved between the main read and write operations with no extra overhead. (6) Data-Driven Execution Model: Our design uses a datadriven execution model, in which data is pushed by the memory to the systolic array and adder trees are triggered by the arrival of data. In this approach, no instruction is used. Instead, the sequence of operand arrivals indicates the sequence of operations. To keep the right sequence of GEMM operations, we use a table to indicate the sequence of the physical locations of the GEMM operands for all the layers. The content of the table is programmed by the host by breaking down the matrix operands of large GEMMs into many small GEMMs (see Memory Mapping). The content of the table is then tracked row-by-row, to stream the right operands at right timings, into the systolic array. After loading one of the operands in the stationary buffers, we pass the remaining operand through the other input of the multiplier, or R1 registers. Once a result is ready, the content of the table is used to write it back to the appropriate physical location. Note that the operations of the next layer are fired once all the GEMMs of the current layer are completed. In this way, the content of the table is also used to guarantee the dependency between sequential operations. Since we are performing inference, we overwrite the results of a layer after reading them once.
IV. EXPERIMENTAL STUDIES
This section shares our experimental results for ETP models, real-world experiments with RPi and PYNQ, edge FPGA implementation, and ASIC estimations. At the start of each subsection, the setup of related experiments is provided.
A. ETP Models
Training Specifications: We train all the models, including the original model, from scratch to conduct a fair comparison. Normalization [70] layers are included to enhance learning. The training is done with an exponential learning rate with decay a factor of 0.94, initial learning rate 1e−2, number of epoch per decay of 2 or 10, a dropout rate of 50%, and L2 regularization with weight decay of 5e−4. We use ADAM optimizer [71] , with β 1 = 0.9 and β 2 = 0.99. All biases are initialized to zeros and all weights are initialized with a normal distribution of mean 0 and a standard deviation of 4e−2. All of our models are trained until the loss is flattened or least for 12 epochs. Test and accuracy measurements are done on at least 10% of datasets that have never been used in training to provide an unbiased evaluation of the final model. For ETP, the Division Factor , F , and ε, are 2, 10%, and ≈ 3%, respectively.
Datesets & Models:
We use the following datasets: (1) MNIST [42] , which contains 70k greyscale handwritten 28x28 images in 10 classes; (2) CIFAR10 [43] , which contains 60k colored 32x32 images in 10 classes; (3) CIFAR100 [43] , which contains 60k colored 32x32 images in 100 classes; (4) Flower102 [44] , which contains 16,378 colored 224x224 images of flowers in 102 classes; and (5) ImageNet [45] , which contains 1.33 M colored 224x224 images in 1000 classes, with a total size of 140 GB. For each dataset we use the representative model, LeNet [54] , LeNet-FC [54] , VGG-S [72] , CifarNet [43] , VGG16 [72] , AlexNetv2 [64] , and ResNet-50 [73] . We use proof-of-concepts models to explore various design options and then use ImageNet models. In total, for brevity, we only report 50 instances of training results to show ETP extensibility using 5 datasets and 8 models. Figure 9 . The results follows the same trend.
splitting renders a layer useless (e.g., kernel size of 4x4 over an input size of 3x3). Table II lists our models' descriptions and training results. Figure 9a illustrates the accuracy difference of our models, shown in Table II . As shown, the maximum accuracy drop is around 5% for CifarNet. Note that this accuracy drop occurs when we reduced the parameter size of our model extensively (around 1 /8). Figure 9b and c show reduction in the number of parameters and computation compared with the original DNN model; as seen, each split reduces both by about split f actor times. This is because each convolution and fully-connected layer in the split version create fewer outputs; therefore, the next layer requires fewer parameters. We restore the accuracy of ETP models with a slight increase in the size of each branch in the next section. Split-Fattened Models Fewer connected in split-only models may cause accuracy loss. Accuracy is a defining factor in several applications. Thus, we provide a remedy to restore the accuracy of split-only models. By fattening (i.e., adding more parameters) each branch, we aim to create larger layers in the split-only models. To do so, for each layer (excluding classification layer) in every branch, we increase its output size (or width) by a fraction. So, fattening by 20% means the size (Table II) (Table II) . of the output in each layer is increased 1.2x. The output size, for fully-connected layers, is the number of output elements and for convolution layers is the number of filters. We fatten every branch by 10%, 20%, 30%, and 40%. Our experiments focus on split8 experiments, which have the highest accuracy drops in CifarNet and VGG-S. Figure 10 depicts a summary of these models. As seen, 40% split-fattened models have higher accuracy than the original model while having fewer parameters and MAC operations. On average (for 30% and 40% models), with 4.61x-3.81x fewer parameters and 2.95x-2.5x fewer MAC operations, split-fattened models achieve similar accuracy, while they jointly optimize memory, computation, and communication loads for distributed edge computation. Large-Scale Models: Table III illustrates the results of ImageNet-based models. For the sake of brevity, we only show split8 and one fattened model. As shown, f40 models restore the accuracy within 3% of the original model. The tradeoff for 3% accuracy loss is about 4x fewer parameters, 4x fewer computations, and 8x less communication load (vs. model parallelism). Figure 11 present a comparative analysis for the communication load between distributed original models with model parallelism and distributed ETP models. Since ETP models avoid communication between their branches, the communication load is reduced significantly. In short, as seen form layer description in Table II , split models are more complex in terms of the number of layers and neuron connections than the original models. Nevertheless, such complexity enables us to jointly optimize ETP models for edge computing.
B. Real-World Data
RPi Experiments Setup: To study the benefits of ETP models versus only model-parallelism methods, we deploy several models on a system of interconnected Raspberry Pi 3s (RPis), Figure 13 depicts measured energy per inference for RPi implementations. To compare with previous related work, SplitNets [76] , Figure 12 present performance of SplitNet models for AlexNet with different configurations. As seen the performance is worst than ETP models. This is because SplitNets create more merging/synchronization points with its tree-structured model design. The resulted model exponentially introduces more merging/synchronization with increased depth which also does not split all the layers. Finally, SplitNets perform parallelization based on dataset semantics, which means every dataset and model needs to be manually split (refer to §VI). TVM Experiments on PYNQ: As a real-world example for edge FPGA implementation, we use TVM [82] on the PYNQ [83] board. PYNQ is designed for embedded applications. We use TVM VTA stack on the PYNQ as the architecture (RISC-style instructions) and only change the models (ResNet-18 vs. ETP ResNet-18 Split2 with <1 accuracy drop). This way, we can measure the benefits of ETP models without relying on any special tailored hardware. Our performance result shares the entire system pipeline performance, from a live camera feed to prediction output on two boards versus one board. Figure 14a depicts a 2.7x speedup in latency, including all communication and system overheads, network latency, and jitter. This is because ETP models are parallelized on two devices and, in total, they have lower computation and memory footprints. In fact, the measured reduction in memory footprint is shown Figure 14b . C. Edge Tailored Hardware on Edge FPGA FPGA Experiments Setup: PYNQ FPGA is an SoCs with dual-core Cortex-A9 processor at 650 MHz and a 512 MB DDR3 memory. The FPGA on the PYNQ board is from the Artix-7 family (Zynq series) with 13,300 logic slices, 220 DSP slices, and 630 KB BRAM. Communication for multiple devices is estimated with the network provided in §IV-B. We implement our tailored microarchitecture using Xilinx Vivado HLS and verify the functionality of our implementation using regression tests. We use relevant #pragrma as hints to describe our desired microarchitectures in C++. We synthesize the designs on Zynq XC7Z020 FPGA and report postimplementation (i.e., place & route) performance numbers and resource utilizations. Inputs and output of our design are transferred through the AXI stream interface. The clock frequency is set to 100 MHz. FPGA Performance: Figure 15 shows the experiment results for our edge-tailored hardware. The latency per image is depicted in Figure 15a , with improvement in communication overhead versus model parallelism methods (86% and 60% for 8split and 4split). Depending on the model, the inference per latency on a single device is between 4-29ms; an 221-325x speedup compared to RPi results for AlexNet and VGG16. Our designed ETP models achieve acceptable performance for edge computing, which is 10s of inferences per second, around 10-1ms. As observed, the accuracy loss of our split-only models can be easily restored by fast split-fattened models of f40 with a negligible performance overhead (maximum of 20 ms). Figure 15b illustrates the speedup numbers over one device. The ideal linear speedup shows the ideal scaling speedup with more available devices. As shown, we achieve superlinear speedups. An important parameter in scaling concerns how the overheads scale. The superlinear speedup stems from the dramatic reduction of communication overhead as parallelism increases. In traditional data and model parallelism, such overhead increases, which causes sublinear speedup. Figure 17 compares latency per image for ETP and model parallelism. On average, ETP models are 3.76x, 8.89x, and 7.17x faster than their model parallelism counterparts for AlexNet, VGG16, and ResNet-50 (4 and 8 devices), respectively. ETP achieves a maximum and average speedups of 56x and 7x, compared to originals (Figure 16 , base bars). Quantization & Pruning: As mention in §V and §I, techniques that reduce the footprint of DNNs can be applied to each ETP individual branch. Basically, the target output for each ETP branch is now its pre-final activations during optimizations. We study the benefits of lossless quantization and structured pruning on top of our ETP models. Based on our experiments, with 3.13 (<integer.fraction>) quantization, our models do not lose accuracy. Similarly, applying structured pruning [84] , for which systolic arrays gain benefits, reduces the size of parameters between 40%-50% per convolution layer without accuracy drop. Other pruning algorithms increase the sparsity of in the data, which is not necessarily beneficial for systolic arrays. Figure 16 depicts the speedup gained from these techniques normalized to the baseline implementation for each model, the execution performance of which shown in Figure 15a . Quantization and pruning themselves, improve the performance of the original models by 1.96x and 2.2x, respectively, and 4.31x when applied together. When quantization and pruning are combined with ETP, the overall performance speedup becomes 14.41x and 16.31x, respectively. Compared to original models, ETP + quantization & pruning achieves up to 244x speedup (VGG16-split8), and an average of 33x (across all models and variants).
D. ASIC Estimations
We configure and model a 32×64 systolic array connected to LPDDR2 memory with the data rate of 933Mb/s/pin @466 MHz [85] words, with our edge-tailored design ( §III-B), we trade large SRAM buffers and systolic arrays for energy efficiency.
V. RELATED WORK
We review related techniques to reduce the high computational and memory demands of DNNs [28, 29] that some are specific to resource-constrained devices. We discuss studies on distributing the computation of DNNs next, and finally, review efforts on DNN hardware accelerators. Techniques without Changing Model Architecture: Several techniques have been developed to reduce the computation and memory footprint of DNNs without changing the network architecture. In weight pruning [27, [88] [89] [90] [91] , the close-to-zero weights are pruned and new weights are retrained. It is also shown that moderate pruning cannot affect the accuracy [27] . In quantization and low-precision inference [41, 58, 59, 92, 93] , the representations of numbers are changed, which results in simpler calculations. Several methods also have been proposed for resource partitioning [94, 95] and binarizing the weights [96] [97] [98] . Binarizing weights hurt accuracy. Several of the aforementioned techniques are orthogonal to ETP models. In fact, they can be applied to each branch to further reduce the computational and memory cost ( §IV-C).
Techniques that Change Model Architecture: With the prevalence of IoT and edge devices, specific frameworks such as ELL library [61] (see Figure 2 ) by Microsoft and Tensorflow Lite [99] have been developed by industry. Several proposals also have developed mobile-specific models [60, 74, [100] [101] [102] . The common approach is to handcraft more efficient operations or models. The objective is to reduce the number of parameters [60] , create efficient operation to minimize computation density [100] , or use resource-efficient connections [102] . Unlike ETP models, all these models have a single-chain of dependency [103] that prevents efficient parallelism. Moreover, several of the models tradeoff the stateof-the-art accuracy with efficiency [102] . We survey Split-Nets [76] and SqueezeNet [60] in Figure 12 . Recently, with the growing interest in automating the design process [103] [104] [105] [106] , learning new networks for mobiles has also gained attention by integrating the constraints of mobile platforms (i.e., latency). These attempts are still limited to single device execution, whereas our paper targets designing models for distributed edge systems. In summary, these related work (1) have a high design cost -i.e., they target only one specific model and dataset without extendibility; (2) target single mobile platforms; and (3) do not consider inter-layer layer parallelism and communication challenges. Distributing DNN Inference Computations: With large DNN models, distributing a single model has gained the attention of researchers [5, 14, 38, 39, 107, 108] . Usually, the distribution is done in a high-performance computing domain with different goals in mind. In the edge and resourceconstrained devices domain, Neurosurgeon [39] dynamically partitions a DNN model between a single edge device and cloud. DDNN [107] partitions the model between edge devices and cloud but uses data parallelism. Hadidi et al. [5, 14] investigate the distribution in robots with model-parallelism methods. In fact, in their results, we observe the effect of the communication barrier in distributing by the diminishing return in performance with a large number of devices. ETP models go beyond model parallelism methods, and enable efficient distribution that is not examined in the above studies.
Edge-Targeted and Systolic-Based Hardware: Several studies explored in/near-the-edge DNN computation without proposing new hardware [27, 39, 109, 110] by training new models, proposing collaboration techniques with cloud, or applying several device-specific and model-specific techniques. A state-of-the-art systolic-based DNN accelerator is TPUv2 [69] , which provides a peak of 180 TFLOPS by employing four dual-core chips, each connected to an 8GB HBM package at 300 GB/s. Many other recent deep-learning accelerators utilize systolic arrays concepts [51, 66-69, 111, 112] , which increase the performance of inference by utilizing sparsity, reducing memory accesses by exploring access patterns, or employing weight-stationary architectures. Several studies also target FPGA/ASIC implementations for DNNs [113] [114] [115] [116] [117] [118] .
These studies investigate the execution of the entire model on a single device with no resource constraints, whereas our focus is enabling the distribution of inference on several devices.
VI. DISCUSSIONS
Intuition Behind ETP: We conjecture that ETP models achieve good performance because (1) independent branches can learn more complex non-overlapping features independently within a small search space; whereas original models need to create the same complex features from a higher dimension feature search. We observe that each branch eventually learns an almost disjoint feature representation. (2) In split models compared to the original models, gradient descent updates are more efficient in reaching early layers due to a lower number of parameters in its route. SplitNets/SqueezeNet: SplitNets [76] shares the same philosophy of splitting models. SplitNet splits the model based on the dataset semantics, which is only applied to the last few layers. As we compared in §IV by implementing their paper setup for AlexNet, SplitNets achieves low performance. The main disadvantages are two-fold: The architecture cannot be parallelized for shallow layers, which causes longer latency, especially in deeper models like ResNet-50 and AlexNet; and second, the parameter reduction is limited to shallow models, such as ResNet-18. SqueezeNet [60] achieves an accuracy similar to that of AlexNet with fewer parameters by using new compute-heavy Fire modules. SqueezeNet tradeoffs parameters with computations. In fact, it requires 860M MAC operations, whereas our distributed AlexNet requires only 240M MAC operations (see Figure 12 ). Also, reducing the number of parameters does not necessarily correlate with reducing the memory footprint in inferencing. In fact, for SqueezeNet, we observe a 12x increase the number of activations (12.58 M in SqueezeNet vs. 1.39 M of AlexNet). Skip/Residual Connections: As shown in §IV-A, our procedure also applies to more complex models with residual and skip connections. Simply put, each branch have similar connections, but with smaller depth. Other mobile-specific models with similar fashion such as ShuffleNet [101] tradeoff accuracy for faster execution besides having a high design cost per model and dataset pair. As Figures 2 and 12 show, for mobile-specific models, the latency for edge devices, which have less compute resources than mobile platforms, is still high. In short, we face the single device Pareto frontier ( §II-B).
Alleviating Large Memory Footprints: Sometimes large memory footprints are necessary and access to the next levels of the storage system is enforced. In our design ( §III-C), such accesses do not cause slowdown because data is stored in sequential addresses (i.e., streaming [119] ), and we overlap data transfer and computations for independent elements. Thus, the execution model only needs a basic memory technology that simultaneously allows reading/writing from/to two nonoverlapping memory locations. Memory Layout Preprocessing: Our simple algorithm to change the storage format is in O(N ) ( §III-C(4)). Therefore, the host preprocessing for reordering the data can be done during writing the data to the memory with a single pass. System-Level Choices: ETP is in conjunction with other technologies available today. ETP does not replace these technologies, but rather enables exploitation of local edge devices. In several cases, relying on cloud-based offloading for accuracy-critical tasks is necessary; whereas, in several others is not. In such cases, ETP models provide an alternative solution. Moreover, if a node fails, various conventional redundancy techniques such as coded computations [120] ) can be applied for recovery.
VII. CONCLUSIONS
We proposed edge-tailored perception models, ETP, designed for efficient in-the-edge distribution. ETP models optimize communication while reducing memory and computation by utilizing several narrow independent branches. We presented our results on the accuracy of ETP models. We tailored a TPU-like systolic architecture for edge computing, shared our insights, and provided implementation results on an edgebased FPGA. Additionally, we conduct experiments on a system of ten Raspberry Pis and two PYNQ boards.
