194 research outputs found

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

    HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs

    Full text link
    We present both a novel Convolutional Neural Network (CNN) accelerator architecture and a network compiler for FPGAs that outperforms all prior work. Instead of having generic processing elements that together process one layer at a time, our network compiler statically partitions available device resources and builds custom-tailored hardware for each layer of a CNN. By building hardware for each layer we can pack our controllers into fewer lookup tables and use dedicated routing. These efficiencies enable our accelerator to utilize 2x the DSPs and operate at more than 2x the frequency of prior work on sparse CNN acceleration on FPGAs. We evaluate the performance of our architecture on both sparse Resnet-50 and dense MobileNet Imagenet classifiers on a Stratix 10 2800 FPGA. We find that the sparse Resnet-50 model has throughput at a batch size of 1 of 4550 images/s, which is nearly 4x the throughput of NVIDIA's fastest machine learning targeted GPU, the V100, and outperforms all prior work on FPGAs.Comment: 8 Pages, 11 Figure

    Efficient Hardware Implementation of Deep Learning Networks Based on the Convolutional Neural Network

    Get PDF
    Image classification, speech processing, autonomous driving, and medical diagnosis have made the adoption of Deep Neural Networks (DNN) mainstream. Many deep networks such as AlexNet, GoogleNet, ResidualNet, MobileNet, YOLOv3 and Transformers have achieved immense success and popularity. However, implementing these deep and complex networks in hardware is a challenging feat. The growing demand of DNN applications in mobile devices and data centers have led the researchers to explore application specific hardware accelerators for DNNs. There have been numerous hardware and software based solutions to improve DNN throughput, latency, performance and accuracy. Any solution for hardware acceleration needs to optimize in a space confined by these metrics. Hardware acceleration of Deep Neural Networks (DNN) is a highly effective and viable solution for running them on mobile devices. The power of DNN is now available at the edge in a compact and power-efficient form factor because of hardware acceleration. In this thesis, we introduce a novel architecture that uses a generalized method called Single Input Partial Product 2-Dimensional Convolution (SIPP2D Convolution) which calculates a 2-D convolution in a fast and expedient manner. We present the exploration designs that have culminated into SIPP2D and emphasize its benefits. SIPP2D architecture prevents the re-fetching of input weights for the calculation of partial products. It can calculate the output of any input size and kernel size with a low memory-traffic while maintaining a low latency and high throughput compared to other popular techniques. In addition to being compatible with any input and kernel size, SIPP2D architecture can be modified to support any allowable stride. We describe the data flow and algorithmic modifications to SIPP2D which extends its capabilities to accommodate multi-stride convolutions. Supporting multi-stride convolutions is an essential feature addition to SIPP2D architecture, increasing its versatility and network agnostic character for convolutional type DNNs. Along with architectural explorations, we have also performed research in the area of model optimization. It is widely understood that any change on the algorithmic level of the network pays significant dividends at the hardware level. Compression and optimization techniques such as pruning and quantization help reduce the size of the model while maintaining the accuracy at an acceptable level. Thus, by combining techniques such as channel pruning with SIPP2D we can only boost its performance. In this thesis, we examine the performance of channel pruned SIPP2D compared to other compressed models. Traditionally, quantization of weights and inputs are used to reduce the memory transfer and power consumption. However, quantizing the outputs of layers can be a challenge since the output of each layer changes with the input. In our research, we use quantization on the output of each layer for AlexNet and VGGNet-16 to analyze the effect it has on accuracy. We use Signal to Noise Quantization Ratio (SQNR) to empirically determine the integer length (IL) as well as the fractional length (FL) for the fixed point precision that can yields the lowest SQNR and highest accuracy. Based on our observations, we can report that accuracy is sensitive to fractional length as well as integer length. For AlexNet, we observe deterioration in accuracy as the word length decreases. The Top -5 accuracy goes from 77% for floating point precision to 56% for a WL of 12 and FL of 8. The results are similar in the case of VGGNet-16. The Top-5 accuracy for VGGNet-16 decreases from 82% for floating point to 30% for a WL of 12 and FL of 8. In addition to the small word length, we observe the accuracy to be highly dependent on the integer length as well as the fractional length. We have also done analysis on the loss after retraining post quantization. We use polynomial fitting to achieve a relationship with fractional length and the drop in accuracy still sustained after retraining a quantized network. In summary, the winning combination of the enhanced SIPP2D architecture and compression techniques such as channel pruning and quantization techniques is highly advantageous and conducive to widespread adoption. SIPP2D architecture, with its flexible data flow and algorithmic modifications to support multi-stride convolutions, offers a powerful and versatile framework for deep neural networks
    • …
    corecore