Search CORE

1,441 research outputs found

Pipelining Saturated Accumulation

Author: Chan Stephanie
DeHon André
Kapre Nachiket
Papadantonakis Karl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2008
Field of study

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

CiteSeerX

Caltech Authors

Comparison of Scalable Montgomery Modular Multiplication Implementations Embedded in Reconfigurable Hardware

Author: Drutarovský Milos
Fischer Viktor
Simka Martin
Publication venue: 'Corporacion Universitaria Latinoamericana CUL'
Publication date: 01/01/2006
Field of study

International audienceThis paper presents a comparison of possible approaches for an efficient implementation of Multiple-word radix-2 Montgomery Modular Multiplication (MM) on modern Field Programmable Gate Arrays (FPGAs). The hardware implementation of MM coprocessor is fully scalable what means that it can be reused in order to generate long-precision results independently on the word length of the originally proposed coprocessor. The first of analyzed implementations uses a data path based on traditionally used redundant carry-save adders, the second one exploits, in scalable designs not yet applied, standard carry-propagate adders with fast carry chain logic. As a control unit and a platform for purely software implementation an embedded soft-core processor Altera NIOS is employed. All implementations use large embedded memory blocks available in recent FPGAs. Speed and logic requirements comparisons are performed on the optimized software and combined hardware-software designs in Altera FPGAs. The issues of targeting a design specifically for a FPGA are considered taking into account the underlying architecture imposed by the target FPGA technology. It is shown that the coprocessors based on carry-save adders and carry-propagate adders provide comparable results in constrained FPGA implementations but in case of carry-propagate logic, the solution requires less embedded memory and provides some additional implementation advantages presented in the paper

HAL-UJM

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Author: Nguyen Hong Diep
Pasca Bogdan
Preusser Thomas,
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2011
Field of study

International audienceInteger addition is a pervasive operation in FPGA designs. The need for fast wide adders grows with the demand for large precisions as, for example, required for the implementation of IEEE-754 quadruple precision and eliptic-curve cryptography. The FPGA realization of fast and compact binary adders relies on hardware carry chains. These provide a natural implementation environment for the ripple-carry addition (RCA) scheme. As its latency grows linearly with the operand width, wide additions call for acceleration, which is quite reasonably achieved by addition schemes built from parallel RCA blocks. This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select/increment adders targeting the hardware carry chains of modern FPGAs. Different trade-offs between latency and area are presented. The proposed architectures represent attractive alternatives to deeply pipelined RCA schemes

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Pipeline-Based Power Reduction in FPGA Applications

Author: Díaz Lavadores Antonio
Rodellar Biarge M. Victoria
Sacristán Miguel Angel
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2008
Field of study

This paper shows how temporal parallelism has an important role in the power dissipation reduction in the FPGA field. Glitches propagation is blocked by the flip-flops or registers in the pipeline. Several multiplication structures are implemented over modern FPGAs, StratixII and Virtex4, comparing their results with and without pipeline and hardware duplication

Archivo Digital UPM

Data path analysis for dynamic circuit specialisation

Author: Davidson Tom
Stroobandt Dirk
Publication venue
Publication date: 01/01/2014
Field of study

Dynamic Circuit Specialisation (DCS) is a method that exploits the reconfigurability of modern FPGAs to allow the specialisation of FPGA circuits at run-time. Currently, it is only explored as part of Register-transfer level design. However, at the Register-transfer level (RTL), a large part of the design is already locked in. Therefore, maximally exploiting the opportunities of DCS could require a costly redesign. It would be interesting to already have insight in the opportunities for DCS from the higher abstraction level. Moreover, the general design trend in FPGA design is to work on higher abstraction levels and let tool(s) translate this higher level description to RTL. This paper presents the first profiler that, based on the high-level description of an application, estimates the benefits of an implementation using DCS. This allows a designer to determine much earlier in the design cycle whether or not DCS would be interesting. The high-level profiling methodology was implemented and tested on a set of PID designs

Crossref

Ghent University Academic Bibliography

Design of approximate overclocked datapath

Author: Shi Kan
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/03/2016
Field of study

Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantisation error into the design. In this thesis, we describe an alternative circuit design methodology when considering trade-offs between accuracy, performance and silicon area. We compare two different approaches that could trade accuracy for performance. One is the traditional approach where the precision used in the datapath is limited to meet a target latency. The other is a proposed new approach which simply allows the datapath to operate without timing closure. We demonstrate analytically and experimentally that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantisation errors. Furthermore, we show that conventional forms of computer arithmetic do not fail gracefully when pushed beyond the deterministic clocking region. In this thesis we take a fresh look at Online Arithmetic, originally proposed for digit serial operation, and synthesize unrolled digit parallel online arithmetic operators to allow for graceful degradation. We quantify the impact of timing violations on key arithmetic primitives, and show that substantial performance benefits can be obtained in comparison to binary arithmetic. Since timing errors are caused by long carry chains, these result in errors in least significant digits with online arithmetic, causing less impact than conventional implementations.Open Acces

Spiral - Imperial College Digital Repository