3,052 research outputs found
Pipeline-Based Power Reduction in FPGA Applications
This paper shows how temporal parallelism has an important role in the power dissipation reduction in the FPGA field. Glitches propagation is blocked by the flip-flops or registers in the pipeline. Several multiplication structures are implemented over modern FPGAs, StratixII and Virtex4, comparing their results with and without pipeline and hardware duplication
Fast HUB Floating-point Adder for FPGA
Several previous publications have shown the area
and delay reduction when implementing real number computation
using HUB formats for both floating-point and fixed-point.
In this paper, we present a HUB floating-point adder for FPGA
which greatly improves the speed of previous proposed HUB
designs for these devices. Our architecture is based on the double
path technique which reduces the execution time since each
path works in parallel. We also deal with the implementation of
unbiased rounding in the proposed adder. Experimental results
are presented showing the goodness of the new HUB adder for
FPGA.TIN2016- 80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
Pipelining Saturated Accumulation
Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM
RTD based logic circuits using generalized threshold gates
Many logic circuit applications of Resonant Tunneling
Diodes are based on the MOnostable-BIstable Logic Element
(MOBILE). Threshold logic is a computational model
widely used in the design of MOBILE circuits, i.e. these circuits
are built from threshold gates (TGs). The MOBILE realization
of generalized threshold gates is being investigated.
Multi-Threshold Threshold Gates (MTTGs) have been proposed
which further increase the functionality of the original TGs.
Recently, we have proposed a novel MOBILE circuit topology
obtained by fundamental properties of threshold functions. This
paper describes the design of n-bit adders using these novel
MOBILE circuit topologies. A comparison with designs based
on TGs and MTTGs is carried out showing advantages in
terms of speed and power delay product and device counts.España, Gobierno TEC2007-67245Junta de AndalucĂa EXC/2007/TIC-296
A general framework for efficient FPGA implementation of matrix product
Original article can be found at: http://www.medjcn.com/ Copyright Softmotor LimitedHigh performance systems are required by the developers for fast processing of computationally intensive applications. Reconfigurable hardware devices in the form of Filed-Programmable Gate Arrays (FPGAs) have been proposed as viable system building blocks in the construction of high performance systems at an economical price. Given the importance and the use of matrix algorithms in scientific computing applications, they seem ideal candidates to harness and exploit the advantages offered by FPGAs. In this paper, a system for matrix algorithm cores generation is described. The system provides a catalog of efficient user-customizable cores, designed for FPGA implementation, ranging in three different matrix algorithm categories: (i) matrix operations, (ii) matrix transforms and (iii) matrix decomposition. The generated core can be either a general purpose or a specific application core. The methodology used in the design and implementation of two specific image processing application cores is presented. The first core is a fully pipelined matrix multiplier for colour space conversion based on distributed arithmetic principles while the second one is a parallel floating-point matrix multiplier designed for 3D affine transformations.Peer reviewe
- …