Search CORE

3,052 research outputs found

Pipeline-Based Power Reduction in FPGA Applications

Author: Díaz Lavadores Antonio
Rodellar Biarge M. Victoria
Sacristán Miguel Angel
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2008
Field of study

This paper shows how temporal parallelism has an important role in the power dissipation reduction in the FPGA field. Glitches propagation is blocked by the flip-flops or registers in the pipeline. Several multiplication structures are implemented over modern FPGAs, StratixII and Virtex4, comparing their results with and without pipeline and hardware duplication

Archivo Digital UPM

Fast HUB Floating-point Adder for FPGA

Author: Gonzalez-Navarro Sonia
Hormigo-Aguilar Javier
Villalba-Moreno Julio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/10/2018
Field of study

Several previous publications have shown the area and delay reduction when implementing real number computation using HUB formats for both floating-point and fixed-point. In this paper, we present a HUB floating-point adder for FPGA which greatly improves the speed of previous proposed HUB designs for these devices. Our architecture is based on the double path technique which reduces the execution time since each path works in parallel. We also deal with the implementation of unbiased rounding in the proposed adder. Experimental results are presented showing the goodness of the new HUB adder for FPGA.TIN2016- 80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147

Crossref

Repositorio Institucional Universidad de Málaga

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Author: Alwani M.
Krizhevsky Alex
Li Huimin
van den Oord Aäron
Publication venue
Publication date: 12/04/2018
Field of study

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x

arXiv.org e-Print Archive

Crossref

Pipelining Saturated Accumulation

Author: Chan Stephanie
DeHon André
Kapre Nachiket
Papadantonakis Karl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2008
Field of study

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

CiteSeerX

Caltech Authors

RTD based logic circuits using generalized threshold gates

Author: Avedillo de Juan María José
Pettenghi Roldán Héctor
Quintana Toledo José María
Publication venue
Publication date: 01/01/2008
Field of study

Many logic circuit applications of Resonant Tunneling Diodes are based on the MOnostable-BIstable Logic Element (MOBILE). Threshold logic is a computational model widely used in the design of MOBILE circuits, i.e. these circuits are built from threshold gates (TGs). The MOBILE realization of generalized threshold gates is being investigated. Multi-Threshold Threshold Gates (MTTGs) have been proposed which further increase the functionality of the original TGs. Recently, we have proposed a novel MOBILE circuit topology obtained by fundamental properties of threshold functions. This paper describes the design of n-bit adders using these novel MOBILE circuit topologies. A comparison with designs based on TGs and MTTGs is carried out showing advantages in terms of speed and power delay product and device counts.España, Gobierno TEC2007-67245Junta de Andalucía EXC/2007/TIC-296

idUS. Depósito de Investigación Universidad de Sevilla

A general framework for efficient FPGA implementation of matrix product

Author: Amira A.
Bensaali F.
Sotudeh R.
Publication venue
Publication date: 01/01/2007
Field of study

Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe

University of Hertfordshire Research Archive