236 research outputs found
Running deep learning applications on resource constrained devices
The high accuracy of Deep Neural Networks (DNN) come at the expense of high computational cost and memory requirements. During inference, the data is often collected on the edge device which are resource-constrained. The existing solutions for edge deployment include i) executing the entire DNN on the edge (EDGE-ONLY), ii) sending the input from edge to cloud where the DNN is processed (CLOUD-ONLY), and iii) splitting the DNN to execute partially on the edge and partially on the cloud (SPLIT). The choice of deployment between EDGE-ONLY, CLOUD-ONLY and SPLIT is determined by several operating constraints such as device resources and network speed, and application constraints such as latency and accuracy. The EDGE-ONLY approach requires compact DNN with low compute and memory requirements. Thus, the emerging class of DNNs employ low-rank convolutions (LRCONVs) which reduce one or more dimensions compared to the spatial convolutions (CONV). Prior research in hardware accelerators has largely focused on CONVs. The LRCONVs such as depthwise and pointwise convolutions exhibit lower arithmetic intensity and lower data reuse. Thus, LRCONVs result in low hardware utilization and high latency. In our first work, we systematically explore the design space of Cross-layer dataflows to exploit data reuse across layers for emerging DNNs in EDGE-ONLY scenarios. We develop novel fine-grain cross-layer dataflows for LRCONVs that support partial loop dimension completion. Our tool, X-Layer decouples the nested loops in a pipeline and combines them to create a common outer dataflow and several inner dataflows. The CLOUD-ONLY approach can suffer from high latency due to the high transmission cost of large input data from the edge to the cloud. This could be a problem, especially for latency-critical applications. Thankfully, the SPLIT approach reduces latency compared to the CLOUD-ONLY approach. However, existing solutions only split the DNN in floating-point precision. Executing floating-point precision on the edge device can occupy large memory and reduce the potential options for SPLIT solutions. In our second work, we expand and explore the search space of SPLIT solutions by jointly applying mixed-precision post-training quantization and DNN graph split. Our work, Auto-Split finds a balance in the trade-off among the model accuracy, edge device capacity, transmission cost, and the overall latency
ReDy: A Novel ReRAM-centric Dynamic Quantization Approach for Energy-efficient CNN Inference
The primary operation in DNNs is the dot product of quantized input
activations and weights. Prior works have proposed the design of memory-centric
architectures based on the Processing-In-Memory (PIM) paradigm. Resistive RAM
(ReRAM) technology is especially appealing for PIM-based DNN accelerators due
to its high density to store weights, low leakage energy, low read latency, and
high performance capabilities to perform the DNN dot-products massively in
parallel within the ReRAM crossbars. However, the main bottleneck of these
architectures is the energy-hungry analog-to-digital conversions (ADCs)
required to perform analog computations in-ReRAM, which penalizes the
efficiency and performance benefits of PIM. To improve energy-efficiency of
in-ReRAM analog dot-product computations we present ReDy, a hardware
accelerator that implements a ReRAM-centric Dynamic quantization scheme to take
advantage of the bit serial streaming and processing of activations. The energy
consumption of ReRAM-based DNN accelerators is directly proportional to the
numerical precision of the input activations of each DNN layer. In particular,
ReDy exploits that activations of CONV layers from Convolutional Neural
Networks (CNNs), a subset of DNNs, are commonly grouped according to the size
of their filters and the size of the ReRAM crossbars. Then, ReDy quantizes
on-the-fly each group of activations with a different numerical precision based
on a novel heuristic that takes into account the statistical distribution of
each group. Overall, ReDy greatly reduces the activity of the ReRAM crossbars
and the number of A/D conversions compared to an static 8-bit uniform
quantization. We evaluate ReDy on a popular set of modern CNNs. On average,
ReDy provides 13\% energy savings over an ISAAC-like accelerator with
negligible accuracy loss and area overhead.Comment: 13 pages, 16 figures, 4 Table
Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead
Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them
- …