Search CORE

547 research outputs found

Multi-Softcore Architecture on FPGA

Author: Mohamed Abid
Mouna Baklouti
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication)

Crossref

Directory of Open Access Journals

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Author: Blott Michaela
Fraser Nicholas J.
Gambardella Giulio
Jahre Magnus
Leong Philip
Umuroglu Yaman
Vissers Kees
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2016
Field of study

Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 {\mu}s latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable Gate Arrays, February 201

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

Designing parameterizable hardware IPs in a model-based design environment for high-level synthesis

Author: Butt Shahzad Ahmad
Lavagno Luciano
Roozmeh Mehdi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Model-based hardware design allows one to map a single model to multiple hardware and/or software architectures, essentially eliminating one of the major limitations of manual coding in C or RTL. Model-based design for hardware implementation has traditionally offered a limited set of microarchitectures, which are typically suitable only for some application scenarios. In this article we illustrate how digital signal processing (DSP) algorithms can be modeled as flexible intellectual property blocks to be used within the popular Simulink model-based design environment. These blocks are written in C and are designed for both functional simulation and hardware implementation, including architectural design space exploration and hardware implementation through high-level synthesis. A key advantage of our modeling approach is that the very same bit-accurate model is used for simulation and high-level synthesis. To prove the feasibility of our proposed approach, we modeled a fast Fourier transform (FFT) algorithm and synthesized it for different DSP applications with very different performance and cost requirements. We also implemented a high-level-synthesis (HLS) intellectual property (IP) generator that can generate flexible FFT HLS-IP blocks that can be mapped to multiple micro-/macroarchitectures, to enable design space exploration as well as being used for functional simulation in the Simulink environment.</jats:p

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

EMVS: Embedded Multi Vector-core System

Author: Ayguadé Parra Eduard
Cristal Kestelman Adrián
Haider Amna
Hussain Tassadaq
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP

Author: Akoglu Ali
Benkrid Khaled
Ling Cheng
Liu Ying
Song Yang
Tian Xiang
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2012
Field of study

This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBM’s Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs

Crossref

Directory of Open Access Journals

Edinburgh Research Explorer

Embedded video stabilization system on field programmable gate array for unmanned aerial vehicle

Author: Mazlan Mohd. Farizal Firdaus
Publication venue
Publication date: 01/01/2017
Field of study

Unmanned Aerial Vehicles (UAVs) equipped with lightweight and low-cost cameras have grown in popularity and enable new applications of UAV technology. However, the video retrieved from small size UAVs is normally in low-quality due to high frequency jitter. This thesis presents the development of video stabilization algorithm implemented on Field Programmable Gate Array (FPGA). The video stabilization algorithm consists of three main processes, which are motion estimation, motion stabilization and motion compensation to minimize the jitter. Motion estimation involves block matching and Random Sample Consensus (RANSAC) to estimate the affine matrix that defines the motion perspective between two consecutive frames. Then, parameter extraction, motion smoothing and motion vector correction, which are parts of the motion stabilization, are tasked in removing unwanted camera movement. Finally, motion compensation stabilizes two consecutive frames based on filtered motion vectors. In order to facilitate the ground station mobility, this algorithm needs to be processed onboard the UAV in real-time. The nature of parallelization of video stabilization processing is suitable to be utilized by using FPGA in order to achieve real-time capability. The implementation of this system is on Altera DE2-115 FPGA board. Full hardware dedicated cores without Nios II processor are designed in stream-oriented architecture to accelerate the computation. Furthermore, a parallelized architecture consisting of block matching and highly parameterizable RANSAC processor modules show that the proposed system is able to achieve up to 30 frames per second processing and a good stabilization improvement up to 1.78 Interframe Transformation Fidelity value. Hence, it is concluded that the proposed system is suitable for real-time video stabilization for UAV application

Universiti Teknologi Malaysia Institutional Repository

Reconfigurable Logic Embedded Architecture of Support Vector Machine Linear Kernel

Author: Andromeda Trias
Marsono M. N.
Shaikh-Husin N.
Sirkunan Jeevan
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/11/2017
Field of study

Support Vector Machine (SVM) is a linear binary classifier that requires a kernel function to handle non-linear problems. Most previous SVM implementations for embedded systems in literature were built targeting a certain application; where analyses were done through comparison with software im- plementations only. The impact of different application datasets towards SVM hardware performance were not analyzed. In this work, we propose a parameterizable linear kernel architecture that is fully pipelined. It is prototyped and analyzed on Altera Cyclone IV platform and results are verified with equivalent software model. Further analysis is done on determining the effect of the number of features and support vectors on the performance of the hardware architecture. From our proposed linear kernel implementation, the number of features determine the maximum operating frequency and amount of logic resource utilization, whereas the number of support vectors determines the amount of on-chip memory usage and also the throughput of the system

Proceeding of the Electrical Engineering Computer Science and Informatics

Reconfigurable Architectures:From Physical Implementation to Dynamic Behavoir Modelling

Author: Wu Kehuai
Publication venue
Publication date: 01/01/2008
Field of study

Online Research Database In Technology