2,106 research outputs found

    High-Radix Formats for Enhancing Floating-Point FPGA Implementations

    Get PDF
    This article proposes a family of high-radix floating-point representation to efficiently deal with floating-point addition in FPGA devices with no native floating-point sup port. Since variable shifter implementation (required in any FP adder) has a very high cost in FPGA, high-radix formats considerably reduce the number of possible shifts, decreasing the execution time and area highly. Although the high-radix format pro duces also a significant penalty in the implementation of multipliers, the experimental results show that the adder improvement overweights the multiplication penalty for most of the practical and common cases (digital filters, matrix multiplications, etc.). We also provide the designer with guidelines on selecting a suitable radix as a funcition of the ratio between the number of additions and multiplications of the targeted algorithm. For applications with similar numbers of additions and multiplications, the high-radix version may be up to 26% faster and even having a wider dynamic range and using higher number of significant bits. Furthermore, thanks to the proposed effi cient converters between the standard IEEE-754 format and our internal high-radix format, the cost of the input/output conversions in FPGA accelerators is negligible.This research has been partially funded by the Spanish Ministry of Science, Innovation and Universities through the projects PID2019-105396RBI00 and by Junta de Andalucía through P18-FR 3130. Funding Open Access funding provided thanks to the CRUE-CSIC. Funding for open access charge: Universidad de Málaga / CBU

    Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module

    Get PDF
    For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 ¿m technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature

    Overview of Parallel Platforms for Common High Performance Computing

    Get PDF
    The paper deals with various parallel platforms used for high performance computing in the signal processing domain. More precisely, the methods exploiting the multicores central processing units such as message passing interface and OpenMP are taken into account. The properties of the programming methods are experimentally proved in the application of a fast Fourier transform and a discrete cosine transform and they are compared with the possibilities of MATLAB's built-in functions and Texas Instruments digital signal processors with very long instruction word architectures. New FFT and DCT implementations were proposed and tested. The implementation phase was compared with CPU based computing methods and with possibilities of the Texas Instruments digital signal processing library on C6747 floating-point DSPs. The optimal combination of computing methods in the signal processing domain and new, fast routines' implementation is proposed as well

    Floating Point Square Root under HUB Format

    Get PDF
    Unit-Biased (HUB) is an emerging format based on shifting the representation line of the binary numbers by half unit in the last place. The HUB format is specially relevant for computers where rounding to nearest is required because it is performed simply by truncation. From a hardware point of view, the circuits implementing this representation save both area and time since rounding does not involve any carry propagation. Designs to perform the four basic operations have been proposed under HUB format recently. Nevertheless, the square root operation has not been confronted yet. In this paper we present an architecture to carry out the square root operation under HUB format for floating point numbers. The results of this work keep supporting the fact that the HUB representation involves simpler hardware than its conventional counterpart for computers requiring round-to-nearest mode.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

    VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing

    Full text link
    The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62% average reductions in area and latency compared to the best reported architecture in literature. We also synthesize the circuits in a 65 nm CMOS technology and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared to the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation which yields 33% reduction in energy consumption w.r.t. the binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure

    Normalizing or not normalizing? An open question for floating-point arithmetic in embedded systems

    Get PDF
    Emerging embedded applications lack of a specific standard when they require floating-point arithmetic. In this situation they use the IEEE-754 standard or ad hoc variations of it. However, this standard was not designed for this purpose. This paper aims to open a debate to define a new extension of the standard to cover embedded applications. In this work, we only focus on the impact of not performing normalization. We show how eliminating the condition of normalized numbers, implementation costs can be dramatically reduced, at the expense of a moderate loss of accuracy. Several architectures to implement addition and multiplication for non-normalized numbers are proposed and analyzed. We show that a combined architecture (adder-multiplier) can halve the area and power consumption of its counterpart IEEE-754 architecture. This saving comes at the cost of reducing an average of about 10 dBs the Signal-to-Noise Ratio for the tested algorithms. We think these results should encourage researchers to perform further investigation in this issue.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
    corecore