Search CORE

508 research outputs found

FPGA-Based CNN Inference Accelerator Synthesized from Multi-Threaded C Software

Author: Anderson Jason H.
Brothers John
Grady Brett
Kim Jin Hee
Lian Ruolong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/07/2018
Field of study

A deep-learning inference accelerator is synthesized from a C-language software program parallelized with Pthreads. The software implementation uses the well-known producer/consumer model with parallel threads interconnected by FIFO queues. The LegUp high-level synthesis (HLS) tool synthesizes threads into parallel FPGA hardware, translating software parallelism into spatial parallelism. A complete system is generated where convolution, pooling and padding are realized in the synthesized accelerator, with remaining tasks executing on an embedded ARM processor. The accelerator incorporates reduced precision, and a novel approach for zero-weight-skipping in convolution. On a mid-sized Intel Arria 10 SoC FPGA, peak performance on VGG-16 is 138 effective GOPS

arXiv.org e-Print Archive

Crossref

Exploiting partial reconfiguration through PCIe for a microphone array network emulator

Author: Braeken An
da Silva Gomes Bruno
Domínguez Federico
Touhafi Abdellah
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2018
Field of study

The current Microelectromechanical Systems (MEMS) technology enables the deployment of relatively low-cost wireless sensor networks composed of MEMS microphone arrays for accurate sound source localization. However, the evaluation and the selection of the most accurate and power-efficient network’s topology are not trivial when considering dynamic MEMS microphone arrays. Although software simulators are usually considered, they consist of high-computational intensive tasks, which require hours to days to be completed. In this paper, we present an FPGA-based platform to emulate a network of microphone arrays. Our platform provides a controlled simulated acoustic environment, able to evaluate the impact of different network configurations such as the number of microphones per array, the network’s topology, or the used detection method. Data fusion techniques, combining the data collected by each node, are used in this platform. The platform is designed to exploit the FPGA’s partial reconfiguration feature to increase the flexibility of the network emulator as well as to increase performance thanks to the use of the PCI-express high-bandwidth interface. On the one hand, the network emulator presents a higher flexibility by partially reconfiguring the nodes’ architecture in runtime. On the other hand, a set of strategies and heuristics to properly use partial reconfiguration allows the acceleration of the emulation by exploiting the execution parallelism. Several experiments are presented to demonstrate some of the capabilities of our platform and the benefits of using partial reconfiguration

Crossref

Ghent University Academic Bibliography

Directory of Open Access Journals

06141 Abstracts Collection -- Dynamically Reconfigurable Architectures

Author: Athanas Peter M.
Brebner Gordon
Publication venue: Dagstuhl Seminar Proceedings. 06141 - Dynamically Reconfigurable Architectures
Publication date: 01/01/2006
Field of study

From 02.04.06 to 07.04.06, the Dagstuhl Seminar 06141 ``Dynamically Reconfigurable Architectures\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Dagstuhl Research Online Publication Server

Exploiting run-time reconfigurable hardware in the development of automatic fingerprint-based personal recognition applications

Author: Francisco Fons
Mariano Fons
Publication venue: 'IntechOpen'
Publication date: 27/07/2011
Field of study

IntechOpen

Hardware implementation of intelligent systems

Author: Ranga Vemuri
Sriram Govindarajan
Iyad Ouaiss
Meenakshi Kaul
Vinoo Srinivasan
Shankar Radhakrishnan
Sujatha Sundaraman
Satish Ganesan
Awartika Pandey
Preetham Lakshmikanthan
Publication venue: Springer Link
Publication date: 01/01/2001
Field of study

interior, interior vie

Crossref

Lebanese American University Repository

MIT Libraries Dome

Boolean decomposition for AIG optimization

Author: Cortadella Jordi
Machado Lucas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Restructuring techniques for And-Inverter Graphs (AIG), such as rewriting and refactoring, are powerful, scalable and fast, achieving highly optimized AIGs after few iterations. However, these techniques are biased by the original AIG structure and limited by single output optimizations. This paper investigates AIG optimization for area, exploring how far Boolean methods can reduce AIG nodes through local optimization.Boolean division is applied for multi-output functions using two-literal divisors and Boolean decomposition is introduced as a method for AIG optimization. Multi-output blocks are extracted from the AIG and optimized, achieving a further AIG node reduction of 7.76% on average for ITC99 and MCNC benchmarks.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Vector coprocessor sharing techniques for multicores: performance and energy gains

Author: Beldianu Spiridon Florin
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2012
Field of study

Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing. Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels. To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments

Digital Commons @ New Jersey Institute of Technology (NJIT)