14 research outputs found
Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture
Deep neural networks (DNNs) have been shown to outperform conventional
machine learning algorithms across a wide range of applications, e.g., image
recognition, object detection, robotics, and natural language processing.
However, the high computational complexity of DNNs often necessitates extremely
fast and efficient hardware. The problem gets worse as the size of neural
networks grows exponentially. As a result, customized hardware accelerators
have been developed to accelerate DNN processing without sacrificing model
accuracy. However, previous accelerator design studies have not fully
considered the characteristics of the target applications, which may lead to
sub-optimal architecture designs. On the other hand, new DNN models have been
developed for better accuracy, but their compatibility with the underlying
hardware accelerator is often overlooked. In this article, we propose an
application-driven framework for architectural design space exploration of DNN
accelerators. This framework is based on a hardware analytical model of
individual DNN operations. It models the accelerator design task as a
multi-dimensional optimization problem. We demonstrate that it can be
efficaciously used in application-driven accelerator architecture design. Given
a target DNN, the framework can generate efficient accelerator design solutions
with optimized performance and area. Furthermore, we explore the opportunity to
use the framework for accelerator configuration optimization under simultaneous
diverse DNN applications. The framework is also capable of improving neural
network models to best fit the underlying hardware resources
Improved Subset Generation For The MU-Decoder
The MU-Decoder is a hardware subset generator that finds use in partial reconfiguration of FPGAs and in numerous other applications. It is capable of generating a set S of subsets of a large set Z_n with n elements. If the subsets in S satisfy the âisomorphic totally- ordered propertyâ, then the MU-Decoder works very efficiently to produce a set of u subsets in O(log n) time and Î(n âu log n) gate cost. In contrast, a vain approach requires Î(un) gate cost. We show that this low cost for the MU-Decoder can be achieved without the isomorphism constraint, thereby allowing S to include a much wider range of subsets. We also show that if additional constraints on the relative sizes of the subsets in S can be placed, then u subsets can be generated with Î(n âu) cost. This uses a new hardware enhancement proposed in this thesis. Finally, we show that by properly selecting S and by using some elements of traditional methods, a set of Î (un^log( log (n/log n))) subsets can be produced with Î(n âu) cost
Acceleration of Deep Learning on FPGA
In recent years, deep convolutional neural networks (ConvNet) have shown their popularity in various real world applications. To provide more accurate results, the state-of-the-art ConvNet requires millions of parameters and billions of operations to process a single image, which represents a computational challenge for general purpose processors. As a result, hardware accelerators such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), have been adopted to improve the performance of ConvNet. However, GPU-based solution consumes a considerable amount of power and a traditional RTL design on FPGA requires tedious development that is very time-consuming. In this work, we propose a scalable and parameterized end-to-end ConvNet design using Intel FPGA SDK for OpenCL. To validate the design, we implement VGG 16 model on two different FPGA boards. Consequently, our designs achieve 306.41 GOPS on Intel Stratix A7 and 318.94 GOPS on Intel Arria 10 GX 10AX115. To the best of our knowledge, this outperforms previous FPGA-based accelerators. Compared to the CPU (Intel Xeon E5-2620) and a mid-range GPU (Nvidia K40), our design is 24.3X and 1.7X more energy efficient respectively
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201