2,154 research outputs found
Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles
We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed
Efficient implementation of 90 degrees phase shifter in FPGA
In this article, we present an efficient way of implementing 90 phase shifter using Hilbert transformer with canonic signed digit (CSD) coefficients in FPGA. It is implemented using 27-tap symmetric finite impulse response (FIR) filter. Representing the filter coefficients by CSD eliminates the need for multipliers and the filter is implemented using shifters and adders/subtractors. The simulated results for the frequency response of the Hilbert transformer with infinite precision coefficients and CSD coefficients agree with each other. The proposed architecture requires less hardware as one adder is saved for the realization of every negative coefficient compared to convectional CSD FIR filter implementation. Also, it offers a high accuracy of phase shift
Combined Integer and Floating Point Multiplication Architecture(CIFM) for FPGAs and Its Reversible Logic Implementation
In this paper, the authors propose the idea of a combined integer and
floating point multiplier(CIFM) for FPGAs. The authors propose the replacement
of existing 18x18 dedicated multipliers in FPGAs with dedicated 24x24
multipliers designed with small 4x4 bit multipliers. It is also proposed that
for every dedicated 24x24 bit multiplier block designed with 4x4 bit
multipliers, four redundant 4x4 multiplier should be provided to enforce the
feature of self repairability (to recover from the faults). In the proposed
CIFM reconfigurability at run time is also provided resulting in low power. The
major source of motivation for providing the dedicated 24x24 bit multiplier
stems from the fact that single precision floating point multiplier requires
24x24 bit integer multiplier for mantissa multiplication. A reconfigurable,
self-repairable 24x24 bit multiplier (implemented with 4x4 bit multiply
modules) will ideally suit this purpose, making FPGAs more suitable for integer
as well floating point operations. A dedicated 4x4 bit multiplier is also
proposed in this paper. Moreover, in the recent years, reversible logic has
emerged as a promising technology having its applications in low power CMOS,
quantum computing, nanotechnology, and optical computing. It is not possible to
realize quantum computing without reversible logic. Thus, this paper also paper
provides the reversible logic implementation of the proposed CIFM. The
reversible CIFM designed and proposed here will form the basis of the
completely reversible FPGAs.Comment: Published in the proceedings of the The 49th IEEE International
Midwest Symposium on Circuits and Systems (MWSCAS 2006), Puerto Rico, August
2006. Nominated for the Student Paper Award(12 papers are nominated for
Student paper Award among all submissions
Pipelining Saturated Accumulation
Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM
Fast HUB Floating-point Adder for FPGA
Several previous publications have shown the area
and delay reduction when implementing real number computation
using HUB formats for both floating-point and fixed-point.
In this paper, we present a HUB floating-point adder for FPGA
which greatly improves the speed of previous proposed HUB
designs for these devices. Our architecture is based on the double
path technique which reduces the execution time since each
path works in parallel. We also deal with the implementation of
unbiased rounding in the proposed adder. Experimental results
are presented showing the goodness of the new HUB adder for
FPGA.TIN2016- 80920-R, JA2012 P12-TIC-1692, JA2012 P12-TIC-147
Parametric, Secure and Compact Implementation of RSA on FPGA
We present a fast, efficient, and parameterized modular multiplier and a secure exponentiation circuit especially intended for FPGAs on the low end of the price range. The design utilizes dedicated block multipliers as the main functional unit and Block-RAM as storage unit for the operands. The adopted design methodology allows adjusting the number of multipliers, the radix used in the multipliers, and number of words to meet the system requirements such as
available resources, precision and timing constraints. The architecture, based on the Montgomery modular multiplication algorithm, utilizes a pipelining technique that allows concurrent operation of hardwired multipliers. Our
design completes 1020-bit and 2040-bit modular multiplications in 7.62 ÎĽs and 27.0 ÎĽs, respectively. The multiplier uses a moderate amount of system resources while achieving the best area-time product in literature. 2040-bit modular exponentiation engine can easily fit into Xilinx Spartan-3E 500; moreover the exponentiation circuit withstands known side channel attacks
Square-rich fixed point polynomial evaluation on FPGAs
Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
- …