331 research outputs found

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine

    Get PDF
    Deep neural networks have achieved impressive results in computer vision and machine learning. Unfortunately, state-of-the-art networks are extremely compute and memory intensive which makes them unsuitable for mW-devices such as IoT end-nodes. Aggressive quantization of these networks dramatically reduces the computation and memory footprint. Binary-weight neural networks (BWNs) follow this trend, pushing weight quantization to the limit. Hardware accelerators for BWNs presented up to now have focused on core efficiency, disregarding I/O bandwidth and system-level efficiency that are crucial for deployment of accelerators in ultra-low power devices. We present Hyperdrive: a BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel binary-weight streaming approach, which can be used for arbitrarily sized convolutional neural network architecture and input resolution by exploiting the natural scalability of the compute units both at chip-level and system-level by arranging Hyperdrive chips systolically in a 2D mesh while processing the entire feature map together in parallel. Hyperdrive achieves 4.3 TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than state-of-the-art BWN accelerators, even if its core uses resource-intensive FP16 arithmetic for increased robustness

    Hardware Accelerator Design for Machine Learning

    Get PDF
    Machine learning is widely used in many modern artificial intelligence applications. Various hardware platforms are implemented to support such applications. Among them, graphics processing unit (GPU) is the most widely used one due to its fast computation speed and compatibility with various algorithms. Field programmable gate arrays (FPGA) show better energy efficiency compared with GPU when computing machine learning algorithm at the cost of low speed. Finally, various application specific integrated circuit (ASIC) architecture is proposed to achieve the best energy efficiency at the cost of less reconfigurability which makes it suitable for special kinds of machine learning algorithms such as a deep convolutional neural network. Finally, analog computing shows a promising methodology to compute large-sized machine learning algorithm due to its low design cost and fast computing speed; however, due to the requirement of the analog-to-digital converter (ADC) in the analog computing, this kind of technique is only applicable to low computation resolution, making it unsuitable for most artificial intelligence (AI) applications

    StreamSVD: Low-rank approximation and streaming accelerator co-design

    Get PDF
    The post-training compression of a Convolutional Neural Network (CNN) aims to produce Pareto-optimal designs on the accuracy-performance frontier when the access to training data is not possible. Low-rank approximation is one of the methods that is often utilised in such cases. However, existing work considers the low-rank approximation of the network and the optimisation of the hardware accelerator separately, leading to systems with sub-optimal performance. This work focuses on the efficient mapping of a CNN into an FPGA device, and presents StreamSVD, a model-accelerator co-design framework 1 . The framework considers simultaneously the compression of a CNN model through a hardware-aware low-rank approximation scheme, and the optimisation of the hardware accelerator's architecture by taking into account the approximation scheme's compute structure. Our results show that the co-designed StreamSVD outperforms existing work that utilises similar low-rank approximation schemes by providing better accuracy-throughput trade-off. The proposed framework also achieves competitive performance compared with other post-training compression methods, even outperforming them under certain cases
    corecore