2,106 research outputs found
High-Radix Formats for Enhancing Floating-Point FPGA Implementations
This article proposes a family of high-radix floating-point representation to efficiently deal with floating-point addition in FPGA devices with no native floating-point sup port. Since variable shifter implementation (required in any FP adder) has a very high
cost in FPGA, high-radix formats considerably reduce the number of possible shifts, decreasing the execution time and area highly. Although the high-radix format pro duces also a significant penalty in the implementation of multipliers, the experimental
results show that the adder improvement overweights the multiplication penalty for most of the practical and common cases (digital filters, matrix multiplications, etc.). We also provide the designer with guidelines on selecting a suitable radix as a funcition of the ratio between the number of additions and multiplications of the targeted algorithm. For applications with similar numbers of additions and multiplications, the high-radix version may be up to 26% faster and even having a wider dynamic range
and using higher number of significant bits. Furthermore, thanks to the proposed effi cient converters between the standard IEEE-754 format and our internal high-radix format, the cost of the input/output conversions in FPGA accelerators is negligible.This research has been partially funded by the Spanish Ministry of Science, Innovation and Universities through the projects PID2019-105396RBI00 and by Junta de Andalucía through P18-FR 3130. Funding Open Access funding provided thanks to the CRUE-CSIC. Funding for open access charge: Universidad de Málaga / CBU
Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module
For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 ¿m technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature
Overview of Parallel Platforms for Common High Performance Computing
The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well
Floating Point Square Root under HUB Format
Unit-Biased (HUB) is an emerging format based on
shifting the representation line of the binary numbers by half
unit in the last place. The HUB format is specially relevant
for computers where rounding to nearest is required because
it is performed simply by truncation. From a hardware point
of view, the circuits implementing this representation save both
area and time since rounding does not involve any carry propagation.
Designs to perform the four basic operations have been
proposed under HUB format recently. Nevertheless, the square
root operation has not been confronted yet. In this paper we
present an architecture to carry out the square root operation
under HUB format for floating point numbers. The results of
this work keep supporting the fact that the HUB representation
involves simpler hardware than its conventional counterpart for
computers requiring round-to-nearest mode.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec
Workshop on Verification and Theorem Proving for Continuous Systems (NetCA Workshop 2005)
Oxford, UK, 26 August 200
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
The hardware implementation of deep neural networks (DNNs) has recently
received tremendous attention: many applications in fact require high-speed
operations that suit a hardware implementation. However, numerous elements and
complex interconnections are usually required, leading to a large area
occupation and copious power consumption. Stochastic computing has shown
promising results for low-power area-efficient hardware implementations, even
though existing stochastic algorithms require long streams that cause long
latencies. In this paper, we propose an integer form of stochastic computation
and introduce some elementary circuits. We then propose an efficient
implementation of a DNN based on integral stochastic computing. The proposed
architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62%
average reductions in area and latency compared to the best reported
architecture in literature. We also synthesize the circuits in a 65 nm CMOS
technology and we show that the proposed integral stochastic architecture
results in up to 21% reduction in energy consumption compared to the binary
radix implementation at the same misclassification rate. Due to fault-tolerant
nature of stochastic architectures, we also consider a quasi-synchronous
implementation which yields 33% reduction in energy consumption w.r.t. the
binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure
Normalizing or not normalizing? An open question for floating-point arithmetic in embedded systems
Emerging embedded applications lack of a specific standard when they require floating-point arithmetic. In this situation they use the IEEE-754 standard or ad hoc variations of it. However, this standard was not designed for this purpose. This paper aims to open a debate to define a new extension of the standard to cover embedded applications. In this work, we only focus on the impact of not performing normalization. We show how eliminating the condition of normalized numbers, implementation costs can be dramatically reduced, at the expense of a moderate loss of accuracy. Several architectures to implement addition and multiplication for non-normalized numbers are proposed and analyzed. We show that a combined architecture (adder-multiplier) can halve the area and power consumption of its counterpart IEEE-754 architecture. This saving comes at the cost of reducing an average of about 10 dBs the Signal-to-Noise Ratio for the tested algorithms. We think these results should encourage researchers to perform further investigation in this issue.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
- …