1,155 research outputs found

    CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices

    Full text link
    Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning; 2) the increased training complexity; and 3) the lack of rigorous guarantee of compression ratio and inference accuracy. To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n2) to O(nlogn) and the storage complexity from O(n2) to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: it can converge to the same effectiveness as DNNs without compression. The CirCNN architecture, a universal DNN inference engine that can be implemented on various hardware/software platforms with configurable network architecture. To demonstrate the performance and energy efficiency, we test CirCNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6-102X energy efficiency improvements compared with the best state-of-the-art results.Comment: 14 pages, 15 Figures, conferenc

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

    Full text link
    Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

    WRPN: Wide Reduced-Precision Networks

    Full text link
    For computer vision applications, prior works have shown the efficacy of reducing numeric precision of model parameters (network weights) in deep neural networks. Activation maps, however, occupy a large memory footprint during both the training and inference step when using mini-batches of inputs. One way to reduce this large memory footprint is to reduce the precision of activations. However, past works have shown that reducing the precision of activations hurts model accuracy. We study schemes to train networks from scratch using reduced-precision activations without hurting accuracy. We reduce the precision of activation maps (along with model parameters) and increase the number of filter maps in a layer, and find that this scheme matches or surpasses the accuracy of the baseline full-precision network. As a result, one can significantly improve the execution efficiency (e.g. reduce dynamic memory footprint, memory bandwidth and computational energy) and speed up the training and inference process with appropriate hardware support. We call our scheme WRPN - wide reduced-precision networks. We report results and show that WRPN scheme is better than previously reported accuracies on ILSVRC-12 dataset while being computationally less expensive compared to previously reported reduced-precision networks

    Recent Advances in Convolutional Neural Network Acceleration

    Full text link
    In recent years, convolutional neural networks (CNNs) have shown great performance in various fields such as image classification, pattern recognition, and multi-media compression. Two of the feature properties, local connectivity and weight sharing, can reduce the number of parameters and increase processing speed during training and inference. However, as the dimension of data becomes higher and the CNN architecture becomes more complicated, the end-to-end approach or the combined manner of CNN is computationally intensive, which becomes limitation to CNN's further implementation. Therefore, it is necessary and urgent to implement CNN in a faster way. In this paper, we first summarize the acceleration methods that contribute to but not limited to CNN by reviewing a broad variety of research papers. We propose a taxonomy in terms of three levels, i.e.~structure level, algorithm level, and implementation level, for acceleration methods. We also analyze the acceleration methods in terms of CNN architecture compression, algorithm optimization, and hardware-based improvement. At last, we give a discussion on different perspectives of these acceleration and optimization methods within each level. The discussion shows that the methods in each level still have large exploration space. By incorporating such a wide range of disciplines, we expect to provide a comprehensive reference for researchers who are interested in CNN acceleration.Comment: submitted to Neurocomputin

    Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

    Full text link
    Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n2n^2) to O(nlognn\log n) and storage complexity from O(n2n^2) to O(nn), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.Comment: 6 figures, AAAI Conference on Artificial Intelligence, 201

    Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges

    Full text link
    With the emerging big data applications of Machine Learning, Speech Recognition, Artificial Intelligence, and DNA Sequencing in recent years, computer architecture research communities are facing the explosive scale of various data explosion. To achieve high efficiency of data-intensive computing, studies of heterogeneous accelerators which focus on latest applications, have become a hot issue in computer architecture domain. At present, the implementation of heterogeneous accelerators mainly relies on heterogeneous computing units such as Application-specific Integrated Circuit (ASIC), Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among the typical heterogeneous architectures above, FPGA-based reconfigurable accelerators have two merits as follows: First, FPGA architecture contains a large number of reconfigurable circuits, which satisfy requirements of high performance and low power consumption when specific applications are running. Second, the reconfigurable architectures of employing FPGA performs prototype systems rapidly and features excellent customizability and reconfigurability. Nowadays, in top-tier conferences of computer architecture, emerging a batch of accelerating works based on FPGA or other reconfigurable architectures. To better review the related work of reconfigurable computing accelerators recently, this survey reserves latest high-level research products of reconfigurable accelerator architectures and algorithm applications as the basis. In this survey, we compare hot research issues and concern domains, furthermore, analyze and illuminate advantages, disadvantages, and challenges of reconfigurable accelerators. In the end, we prospect the development tendency of accelerator architectures in the future, hoping to provide a reference for computer architecture researchers

    Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-Neural-Network Architectures

    Full text link
    Over the last five years Deep Neural Nets have offered more accurate solutions to many problems in speech recognition, and computer vision, and these solutions have surpassed a threshold of acceptability for many applications. As a result, Deep Neural Networks have supplanted other approaches to solving problems in these areas, and enabled many new applications. While the design of Deep Neural Nets is still something of an art form, in our work we have found basic principles of design space exploration used to develop embedded microprocessor architectures to be highly applicable to the design of Deep Neural Net architectures. In particular, we have used these design principles to create a novel Deep Neural Net called SqueezeNet that requires as little as 480KB of storage for its model parameters. We have further integrated all these experiences to develop something of a playbook for creating small Deep Neural Nets for embedded systems.Comment: Keynote at Embedded Systems Week (ESWEEK) 201

    NICE: Noise Injection and Clamping Estimation for Neural Network Quantization

    Full text link
    Convolutional Neural Networks (CNN) are very popular in many fields including computer vision, speech recognition, natural language processing, to name a few. Though deep learning leads to groundbreaking performance in these domains, the networks used are very demanding computationally and are far from real-time even on a GPU, which is not power efficient and therefore does not suit low power systems such as mobile devices. To overcome this challenge, some solutions have been proposed for quantizing the weights and activations of these networks, which accelerate the runtime significantly. Yet, this acceleration comes at the cost of a larger error. The \uniqname method proposed in this work trains quantized neural networks by noise injection and a learned clamping, which improve the accuracy. This leads to state-of-the-art results on various regression and classification tasks, e.g., ImageNet classification with architectures such as ResNet-18/34/50 with low as 3-bit weights and activations. We implement the proposed solution on an FPGA to demonstrate its applicability for low power real-time applications. The implementation of the paper is available at https://github.com/Lancer555/NIC

    EIE: Efficient Inference Engine on Compressed Deep Neural Network

    Full text link
    State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly: https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision: http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University: https://goo.gl/6lwuer. Published as a conference paper in ISCA 201
    corecore