6 research outputs found

    Hardware-Efficient Structure of the Accelerating Module for Implementation of Convolutional Neural Network Basic Operation

    Full text link
    This paper presents a structural design of the hardware-efficient module for implementation of convolution neural network (CNN) basic operation with reduced implementation complexity. For this purpose we utilize some modification of the Winograd minimal filtering method as well as computation vectorization principles. This module calculate inner products of two consecutive segments of the original data sequence, formed by a sliding window of length 3, with the elements of a filter impulse response. The fully parallel structure of the module for calculating these two inner products, based on the implementation of a naive method of calculation, requires 6 binary multipliers and 4 binary adders. The use of the Winograd minimal filtering method allows to construct a module structure that requires only 4 binary multipliers and 8 binary adders. Since a high-performance convolutional neural network can contain tens or even hundreds of such modules, such a reduction can have a significant effect.Comment: 3 pages, 5 figure

    Virtualizing Reconfigurable Architectures: From Fpgas To Beyond

    Get PDF
    With field-programmable gate arrays (FPGAs) being widely deployed in data centers to enhance the computing performance, an efficient virtualization support is required to fully unleash the potential of cloud FPGAs. However, the system support for FPGAs in the context of the cloud environment is still in its infancy, which leads to a low resource utilization due to the tight coupling between compilation and resource allocation. Moreover, the system support proposed in existing works is limited to a homogeneous FPGA cluster comprising identical FPGA devices, which is hard to be extended to a heterogeneous FPGA cluster that comprises multiple types of FPGAs. As the FPGA cloud is expected to become increasingly heterogeneous due to the hardware rolling upgrade strategy, it is necessary to provide efficient virtualization support for the heterogeneous FPGA cluster. In this dissertation, we first identify three pairs of conflicting requirements from runtime management and offline compilation, which are related to the tradeoff between flexibility and efficiency. These conflicting requirements are the fundamental reason why the single-level abstraction proposed in prior works for the homogeneous FPGA cluster cannot be trivially extended to the heterogeneous cluster. To decouple these conflicting requirements, we provide a two-level system abstraction. Specifically, the high-level abstraction is FPGA-agnostic and provides a simple and homogeneous view of the FPGA resources to simplify the runtime management and maximize the flexibility. On the contrary, the low-level abstraction is FPGA-specific and exposes sufficient low-level hardware details to the compilation framework to ensure the mapping quality and maximize the efficiency. This generic two-level system abstraction can also be specialized to the homogeneous FPGA cluster and/or be extended to leverage application-specific information to further improve the efficiency. We also develop a compilation framework and a modular runtime system with a heuristic-based runtime management policy to support this two-level system abstraction. By enabling a dynamic FPGA sharing at the sub-FPGA granularity, the proposed virtualization solution can deploy 1.62x more applications using the same amount of FPGA resources and reduce the compilation time by 22.6% (perform as many compilation tasks in parallel as possible) with an acceptable virtualization overhead, i.e., Finally, we use Liquid Silicon as a case study to show that the proposed virtualization solution can be extended to other spatial reconfigurable architectures. Liquid Silicon is a homogeneous reconfigurable architecture enabled by the non-volatile memory technology (i.e., RRAM). It extends the configuration capability of existing FPGAs from computation to the whole spectrum ranging from computation to data storage. It allows users to better customize hardware by flexibly partitioning hardware resources between computation and memory based on the actual usage. Instead of naively applying the proposed virtualization solution onto Liquid Silicon, we co-optimize the system abstraction and Liquid Silicon architecture to improve the performance
    corecore