209 research outputs found
Hardware-Efficient Structure of the Accelerating Module for Implementation of Convolutional Neural Network Basic Operation
This paper presents a structural design of the hardware-efficient module for
implementation of convolution neural network (CNN) basic operation with reduced
implementation complexity. For this purpose we utilize some modification of the
Winograd minimal filtering method as well as computation vectorization
principles. This module calculate inner products of two consecutive segments of
the original data sequence, formed by a sliding window of length 3, with the
elements of a filter impulse response. The fully parallel structure of the
module for calculating these two inner products, based on the implementation of
a naive method of calculation, requires 6 binary multipliers and 4 binary
adders. The use of the Winograd minimal filtering method allows to construct a
module structure that requires only 4 binary multipliers and 8 binary adders.
Since a high-performance convolutional neural network can contain tens or even
hundreds of such modules, such a reduction can have a significant effect.Comment: 3 pages, 5 figure
PCNNA: A Photonic Convolutional Neural Network Accelerator
Convolutional Neural Networks (CNN) have been the centerpiece of many
applications including but not limited to computer vision, speech processing,
and Natural Language Processing (NLP). However, the computationally expensive
convolution operations impose many challenges to the performance and
scalability of CNNs. In parallel, photonic systems, which are traditionally
employed for data communication, have enjoyed recent popularity for data
processing due to their high bandwidth, low power consumption, and
reconfigurability. Here we propose a Photonic Convolutional Neural Network
Accelerator (PCNNA) as a proof of concept design to speedup the convolution
operation for CNNs. Our design is based on the recently introduced silicon
photonic microring weight banks, which use broadcast-and-weight protocol to
perform Multiply And Accumulate (MAC) operation and move data through layers of
a neural network. Here, we aim to exploit the synergy between the inherent
parallelism of photonics in the form of Wavelength Division Multiplexing (WDM)
and sparsity of connections between input feature maps and kernels in CNNs.
While our full system design offers up to more than 3 orders of magnitude
speedup in execution time, its optical core potentially offers more than 5
order of magnitude speedup compared to state-of-the-art electronic
counterparts.Comment: 5 Pages, 6 Figures, IEEE SOCC 201
A Review on AI Chip Design
In recent years, artificial intelligence (AI) technologies have been widely used in many business areas. With the attention and investment of scientific researchers and research companies around the world, artificial intelligence technologies have proven their irreplaceable value in traditional speech recognition, image recognition, search/recommendation engines, and other areas. At the same time, however, the computational effort for artificial intelligence technologies is increasing dramatically, posing a huge challenge to the computing power of hardware devices. First, in this paper, we describe the direction of AI chip technology development, including the technical shortcomings of existing AI chips. So, we present the directions of AI chip development in recent years
Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs
Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times
- …