2,441 research outputs found

    Pipelining Of Double Precision Floating Point Division And Square Root Operations On Field-programmable Gate Arrays

    Get PDF
    Many space applications, such as vision-based systems, synthetic aperture radar, and radar altimetry rely increasingly on high data rate DSP algorithms. These algorithms use double precision floating point arithmetic operations. While most DSP applications can be executed on DSP processors, the DSP numerical requirements of these new space applications surpass by far the numerical capabilities of many current DSP processors. Since the tradition in DSP processing has been to use fixed point number representation, only recently have DSP processors begun to incorporate floating point arithmetic units, even though most of these units handle only single precision floating point addition/subtraction, multiplication, and occasionally division. While DSP processors are slowly evolving to meet the numerical requirements of newer space applications, FPGA densities have rapidly increased to parallel and surpass even the gate densities of many DSP processors and commodity CPUs. This makes them attractive platforms to implement compute-intensive DSP computations. Even in the presence of this clear advantage on the side of FPGAs, few attempts have been made to examine how wide precision floating point arithmetic, particularly division and square root operations, can perform on FPGAs to support these compute-intensive DSP applications. In this context, this thesis presents the sequential and pipelined designs of IEEE-754 compliant double floating point division and square root operations based on low radix digit recurrence algorithms. FPGA implementations of these algorithms have the advantage of being easily testable. In particular, the pipelined designs are synthesized based on careful partial and full unrolling of the iterations in the digit recurrence algorithms. In the overall, the implementations of the sequential and pipelined designs are common-denominator implementations which do not use any performance-enhancing embedded components such as multipliers and block memory. As these implementations exploit exclusively the fine-grain reconfigurable resources of Virtex FPGAs, they are easily portable to other FPGAs with similar reconfigurable fabrics without any major modifications. The pipelined designs of these two operations are evaluated in terms of area, throughput, and dynamic power consumption as a function of pipeline depth. Pipelining experiments reveal that the area overhead tends to remain constant regardless of the degree of pipelining to which the design is submitted, while the throughput increases with pipeline depth. In addition, these experiments reveal that pipelining reduces power considerably in shallow pipelines. Pipelining further these designs does not necessarily lead to significant power reduction. By partitioning these designs into deeper pipelines, these designs can reach throughputs close to the 100 MFLOPS mark by consuming a modest 1% to 8% of the reconfigurable fabric within a Virtex-II XC2VX000 (e.g., XC2V1000 or XC2V6000) FPGA

    Mapping DSP algorithms to a reconfigurable architecture Adaptive Wireless Networking (AWGN)

    Get PDF
    This report will discuss the Adaptive Wireless Networking project. The vision of the Adaptive Wireless Networking project will be given. The strategy of the project will be the implementation of multiple communication systems in dynamically reconfigurable heterogeneous hardware. An overview of a wireless LAN communication system, namely HiperLAN/2, and a Bluetooth communication system will be given. Possible implementations of these systems in a dynamically reconfigurable architecture are discussed. Suggestions for future activities in the Adaptive Wireless Networking project are also given

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    Computer Architectures to Close the Loop in Real-time Optimization

    Get PDF
    © 2015 IEEE.Many modern control, automation, signal processing and machine learning applications rely on solving a sequence of optimization problems, which are updated with measurements of a real system that evolves in time. The solutions of each of these optimization problems are then used to make decisions, which may be followed by changing some parameters of the physical system, thereby resulting in a feedback loop between the computing and the physical system. Real-time optimization is not the same as fast optimization, due to the fact that the computation is affected by an uncertain system that evolves in time. The suitability of a design should therefore not be judged from the optimality of a single optimization problem, but based on the evolution of the entire cyber-physical system. The algorithms and hardware used for solving a single optimization problem in the office might therefore be far from ideal when solving a sequence of real-time optimization problems. Instead of there being a single, optimal design, one has to trade-off a number of objectives, including performance, robustness, energy usage, size and cost. We therefore provide here a tutorial introduction to some of the questions and implementation issues that arise in real-time optimization applications. We will concentrate on some of the decisions that have to be made when designing the computing architecture and algorithm and argue that the choice of one informs the other

    Efficient digital implementation of a multi-precision square-root algorithm

    Get PDF
    In high performance computing systems and signal processing, there is a basic set of mathematical functions that are essential. While addition, subtraction and multiplication are well understood, there is less literature on square-rooting, which is a particularly time- and resource-consuming function. Traditional non-restoring algorithms produce a mantissa half the length ofthe input mantissa, causing a loss of precision. This study presents a method for increasing the accuracy of this algorithm. It is shown to work for all IEEE-754R standard floating-point numbers. Error analysis shows a 57-fold (for half-precision) and 134e6-fold improvement (for double-precision) in the normalised error, equivalent to at most 1 Units of Least Precision. Resource and performance optimised variants are analysed and their throughput analysed. On an Intel Stratix V device, performance optimised implementations achieve a throughput of 717 MFLOPs. Resource optimised implementations on a low-cost device require only 127 Adaptive Logic Modules and 232 registers, with a throughput of 8.56 MFLOPs. All implementations are DSPblock and memory free, saving valuable resources. The maximum throughput of the presented design is 15.5 times greater than that proposed by Pimentel et al. and two orders of magnitude greater than typical multiply-accumulate methods

    Design and Implementation of an Universal Lattice Decoder on FPGA

    Get PDF
    In wireless communication, MIMO (multiple input multiple output) is one of the promising technologies which improves the range and performance of transmission without increasing the bandwidth, while providing high rates. High speed hardware MIMO decoders are one of the keys to apply this technology in applications. In order to support the high data rates, the underlying hardware must have significant processing capabilities. FPGA improves the speed of signal processing using parallelism and reconfigurability advantages. The objective of this thesis is to develop an efficient hardware architectural model for the universal lattice decoder and prototype it on FPGA. The original algorithm is modified to ensure the high data rate via taking the advantage of FPGA features. The simulation results of software, hardware are verified and the BER performance of both the algorithms is estimated. The system prototype of the decoder with 4-transmit and 4-receive antennas using a 4-PAM (Pulse amplitude modulation) supports 6.32 Mbit/s data rate for parallelpipeline implementation on FPGA platform, which is about two orders of magnitude faster than its DSP implementation

    Parallel Image Gradient Extraction Core For FPGA-based Smart Cameras

    Get PDF
    International audienceOne of the biggest efforts in designing pervasive Smart Camera Networks (SCNs) is the implementation of complex and computationally intensive computer vision algorithms on resource constrained embedded devices. For low-level processing FPGA devices are excellent candidates because they support massive and fine grain data parallelism with high data throughput. However, if FPGAs offers a way to meet the stringent constraints of real-time execution, their exploitation often require significant algorithmic reformulations. In this paper, we propose a reformulation of a kernel-based gradient computation module specially suited to FPGA implementations. This resulting algorithm operates on-the-fly, without the need of video buffers and delivers a constant throughput. It has been tested and used as the first stage of an application performing extraction of Histograms of Oriented Gradients (HOG). Evaluation shows that its performance and low memory requirement perfectly matches low cost and memory constrained embedded devices
    corecore