306 research outputs found

    ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

    Full text link
    Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.Comment: 12 page

    The Impact of Single Event Effect Reliability of Convolution Neural Network Architectures and Hardening Approaches Implemented on SRAM FPGA

    Get PDF
    Convolution neural networks (CNNs) have powerful data processing and learning capabilities, which have been widely applied to image processing related applications, especially in autonomous driving, medical image classification, space exploration and military applications. Due to the low power consumption, high flexibility, and parallel characteristics of modern field-programmable gate arrays (FPGAs), they are frequently used in CNN implementation as a hardware acceleration platform. Two architectures are mainly used to implement CNNs on FPGAs: the streaming architecture and single computation engines (SCEs) architecture. In the streaming architecture of a CNN, each layer is implemented with one distinct hardware block and each block can be optimized separately. On the other hand, the single computation engine architecture uses a systolic array of processing elements or a matrix multiplication unit as a computation engine to execute the CNN layers sequentially. The control of the hardware and the scheduling of operations is performed by a control unit and associated software. The advantage of this design paradigm is that it consists of a fixed architectural template that can be scaled based on the input of CNNs and the available FPGA resources. Therefore, it is suitable to implement modern complex CNNs that may not fit into the streaming architecture. SRAM-based FPGAs are sensitive to radiation effects, which can generate single event effects (SEEs) in the system. Designs are required to reduce the radiation effects in FPGA-based CNNs for many applications. Previous radiation effects studies mainly focused on streaming architecture and explored triple-modular redundancy (TMR) or selective hardening techniques. As far as the authors know, there are very few radiation effects studies on the CNNs implemented with SCEs architecture on FPGAs and no radiation effects evaluation between the two architectures with proton irradiation. In this thesis, we implement a Modified National Institute of Standards and Technology (MNIST) CNN with two mainstream architectures, both streaming architecture and SCEs architecture, on a Xilinx Zynq UltraScale+ multiprocessor system on a chip (MPSoC) ZCU-102 evaluation kit. Then we evaluate their error, hang, and total failure rate with proton irradiation test at Tri-University Meson Facility (TRIUMF). The cross-section results for different architectures showed that the SCEs design has higher error cross-sections and total failure cross-sections than that of the streaming architecture, even though SCEs architecture uses much fewer hardware resources in FPGA. In addition, two resilience techniques for SCEs architecture named spatial TMR and temporal TMR are designed and adopted for the SCEs architecture with the same hardware structure and utilization by reusing process elements (PEs) or using multiple PEs to carry out each calculation. As a result, the cross-sections of the spatial TMR and temporal TMR SCEs architecture designs are reduced by 34.9% and 59.2%, with an execution time overhead of 14.2% and 21.4% compared with non-harden one, respectively. Thus, the study shows that SCEs architecture for FPGA acceleration has excellent potential for applications in a radiation environment with minimal overhead due to its scalability and flexibility, and spatial TMR and temporal TMR could effectively reduce the error rate and total failure rate with no extra hardware resources. This suggests that spatial TMR and temporal TMR propose in my project seems to be generic for SCEs architecture, and it could be a better redundancy choice for complex CNNs implement with not enough hardware resources

    Hardware compilation of deep neural networks: an overview

    Get PDF
    Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research
    corecore