Search CORE

531 research outputs found

Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs

Author: Kadetotad Deepak
Kim Minkyu
Linares Barranco Alejandro
Ríos Navarro José Antonio
Seo Jae-Sun
Tapiador Morales Ricardo
Publication venue: 'SAGE Publications'
Publication date: 01/01/2016
Field of study

Deep learning has significantly advanced the state of the art in artificial intelligence, gaining wide popularity from both industry and academia. Special interest is around Convolutional Neural Networks (CNN), which take inspiration from the hierarchical structure of the visual cortex, to form deep layers of convolutional operations, along with fully connected classifiers. Hardware implementations of these deep CNN architectures are challenged with memory bottlenecks that require many convolution and fully-connected layers demanding large amount of communication for parallel computation. Multi-core CPU based solutions have demonstrated their inadequacy for this problem due to the memory wall and low parallelism. Many-core GPU architectures show superior performance but they consume high power and also have memory constraints due to inconsistencies between cache and main memory. FPGA design solutions are also actively being explored, which allow implementing the memory hierarchy using embedded BlockRAM. This boosts the parallel use of shared memory elements between multiple processing units, avoiding data replicability and inconsistencies. This makes FPGAs potentially powerful solutions for real-time classification of CNNs. Both Altera and Xilinx have adopted OpenCL co-design framework from GPU for FPGA designs as a pseudo-automatic development solution. In this paper, a comprehensive evaluation and comparison of Altera and Xilinx OpenCL frameworks for a 5-layer deep CNN is presented. Hardware resources, temporal performance and the OpenCL architecture for CNNs are discussed. Xilinx demonstrates faster synthesis, better FPGA resource utilization and more compact boards. Altera provides multi-platforms tools, mature design community and better execution times

idUS. Depósito de Investigación Universidad de Sevilla

Neuromorphic deep convolutional neural network learning systems for FPGA in real time

Author: Tapiador Morales Ricardo
Publication venue
Publication date: 13/12/2019
Field of study

Deep Learning algorithms have become one of the best approaches for pattern recognition in several fields, including computer vision, speech recognition, natural language processing, and audio recognition, among others. In image vision, convolutional neural networks stand out, due to their relatively simple supervised training and their efficiency extracting features from a scene. Nowadays, there exist several implementations of convolutional neural networks accelerators that manage to perform these networks in real time. However, the number of operations and power consumption of these implementations can be reduced using a different processing paradigm as neuromorphic engineering. Neuromorphic engineering field studies the behavior of biological and inner systems of the human neural processing with the purpose of design analog, digital or mixed-signal systems to solve problems inspired in how human brain performs complex tasks, replicating the behavior and properties of biological neurons. Neuromorphic engineering tries to give an answer to how our brain is capable to learn and perform complex tasks with high efficiency under the paradigm of spike-based computation. This thesis explores both frame-based and spike-based processing paradigms for the development of hardware architectures for visual pattern recognition based on convolutional neural networks. In this work, two FPGA implementations of convolutional neural networks accelerator architectures for frame-based using OpenCL and SoC technologies are presented. Followed by a novel neuromorphic convolution processor for spike-based processing paradigm, which implements the same behaviour of leaky integrate-and-fire neuron model. Furthermore, it reads the data in rows being able to perform multiple layers in the same chip. Finally, a novel FPGA implementation of Hierarchy of Time Surfaces algorithm and a new memory model for spike-based systems are proposed

idUS. Depósito de Investigación Universidad de Sevilla

Virtual Prototyping for Dynamically Reconfigurable Architectures using Dynamic Generic Mapping

Author: Gibson Darrell
Long David Ian
Vasilko Milan
Publication venue
Publication date: 26/10/1998
Field of study

This paper presents a virtual prototyping methodology for Dynamically Reconfigurable (DR) FPGAs. The methodology is based around a library of VHDL image processing components and allows the rapid prototyping and algorithmic development of low-level image processing systems. For the effective modelling of dynamically reconfigurable designs a new technique named, Dynamic Generic Mapping is introduced. This method allows efficient representation of dynamic reconfiguration without needing any additional components to model the reconfiguration process. This gives the designer more flexibility in modelling dynamic configurations than other methodologies. Models created using this technique can then be simulated and targeted to a specific technology using the same code. This technique is demonstrated through the realisation of modules for a motion tracking system targeted to a DR environment, RIFLE-62

Bournemouth University Research Online

XNOR Neural Engine: a Hardware Accelerator IP for 21.6 fJ/op Binary Neural Network Inference

Author: Benini Luca
Conti Francesco
Schiavone Pasquale Davide
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Binary Neural Networks (BNNs) are promising to deliver accuracy comparable to conventional deep neural networks at a fraction of the cost in terms of memory and energy. In this paper, we introduce the XNOR Neural Engine (XNE), a fully digital configurable hardware accelerator IP for BNNs, integrated within a microcontroller unit (MCU) equipped with an autonomous I/O subsystem and hybrid SRAM / standard cell memory. The XNE is able to fully compute convolutional and dense layers in autonomy or in cooperation with the core in the MCU to realize more complex behaviors. We show post-synthesis results in 65nm and 22nm technology for the XNE IP and post-layout results in 22nm for the full MCU indicating that this system can drop the energy cost per binary operation to 21.6fJ per operation at 0.4V, and at the same time is flexible and performant enough to execute state-of-the-art BNN topologies such as ResNet-34 in less than 2.2mJ per frame at 8.9 fps.Comment: 11 pages, 8 figures, 2 tables, 3 listings. Accepted for presentation at CODES'18 and for publication in IEEE Transactions on Computer-Aided Design of Circuits and Systems (TCAD) as part of the ESWEEK-TCAD special issu

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna