3,625 research outputs found

    Pipelining Of Double Precision Floating Point Division And Square Root Operations On Field-programmable Gate Arrays

    Get PDF
    Many space applications, such as vision-based systems, synthetic aperture radar, and radar altimetry rely increasingly on high data rate DSP algorithms. These algorithms use double precision floating point arithmetic operations. While most DSP applications can be executed on DSP processors, the DSP numerical requirements of these new space applications surpass by far the numerical capabilities of many current DSP processors. Since the tradition in DSP processing has been to use fixed point number representation, only recently have DSP processors begun to incorporate floating point arithmetic units, even though most of these units handle only single precision floating point addition/subtraction, multiplication, and occasionally division. While DSP processors are slowly evolving to meet the numerical requirements of newer space applications, FPGA densities have rapidly increased to parallel and surpass even the gate densities of many DSP processors and commodity CPUs. This makes them attractive platforms to implement compute-intensive DSP computations. Even in the presence of this clear advantage on the side of FPGAs, few attempts have been made to examine how wide precision floating point arithmetic, particularly division and square root operations, can perform on FPGAs to support these compute-intensive DSP applications. In this context, this thesis presents the sequential and pipelined designs of IEEE-754 compliant double floating point division and square root operations based on low radix digit recurrence algorithms. FPGA implementations of these algorithms have the advantage of being easily testable. In particular, the pipelined designs are synthesized based on careful partial and full unrolling of the iterations in the digit recurrence algorithms. In the overall, the implementations of the sequential and pipelined designs are common-denominator implementations which do not use any performance-enhancing embedded components such as multipliers and block memory. As these implementations exploit exclusively the fine-grain reconfigurable resources of Virtex FPGAs, they are easily portable to other FPGAs with similar reconfigurable fabrics without any major modifications. The pipelined designs of these two operations are evaluated in terms of area, throughput, and dynamic power consumption as a function of pipeline depth. Pipelining experiments reveal that the area overhead tends to remain constant regardless of the degree of pipelining to which the design is submitted, while the throughput increases with pipeline depth. In addition, these experiments reveal that pipelining reduces power considerably in shallow pipelines. Pipelining further these designs does not necessarily lead to significant power reduction. By partitioning these designs into deeper pipelines, these designs can reach throughputs close to the 100 MFLOPS mark by consuming a modest 1% to 8% of the reconfigurable fabric within a Virtex-II XC2VX000 (e.g., XC2V1000 or XC2V6000) FPGA

    A System for Compressive Sensing Signal Reconstruction

    Full text link
    An architecture for hardware realization of a system for sparse signal reconstruction is presented. The threshold based reconstruction method is considered, which is further modified in this paper to reduce the system complexity in order to provide easier hardware realization. Instead of using the partial random Fourier transform matrix, the minimization problem is reformulated using only the triangular R matrix from the QR decomposition. The triangular R matrix can be efficiently implemented in hardware without calculating the orthogonal Q matrix. A flexible and scalable realization of matrix R is proposed, such that the size of R changes with the number of available samples and sparsity level.Comment: 6 page

    Strategies for FPGA Implementation of Non-Restoring Square Root Algorithm

    Get PDF
    This paper presents three strategies to implement non restoring square root algorithm based on FPGA. A new basic building block is called controlled subtract-multiplex (CSM) is introduced in first strategy which use gate level abstraction. The main principle of the method is similar with conventional non-restoring algorithm, but it only uses subtract operation and append 01, while add operation and append 11 is not used. Second strategy presents the first strategy in register transfer level (RTL) abstraction. In third strategy, a modification for the implementation of conventional non-restoring algorithm is presented which also use RTL abstraction. The all above strategies is implemented in VHDL programming and adopt fully pipelined architecture. The strategies have conducted to implement successfully in FPGA hardware, and each of the strategies is offer an efficient in hardware resource. In generally, the third strategy is superior.DOI:http://dx.doi.org/10.11591/ijece.v4i4.600

    AN OPTIMIZED SQUARE ROOT ALGORITHM FOR IMPLEMENTATION IN FPGA HARDWARE

    Get PDF
    This paper presents an optimized digit-by-digit calculation method to solve complicated square root calculation in hardware, as a proposed simple algorithm for implementation in field programmable gate array (FPGA). The main principle of proposed method is two-bit shifting and subtracting-multiplexing operations, in order to achieve a simpler implementation and faster calculation. The proposed algorithm has conducted to implement FPGA based unsigned 32-bit and 64-bit binary square root successfully. The results have shown that proposed method is most efficient of hardware resource compare to other methods. In addition, the strategy can be expanded to larger number easily

    Design and Implementation of an Universal Lattice Decoder on FPGA

    Get PDF
    In wireless communication, MIMO (multiple input multiple output) is one of the promising technologies which improves the range and performance of transmission without increasing the bandwidth, while providing high rates. High speed hardware MIMO decoders are one of the keys to apply this technology in applications. In order to support the high data rates, the underlying hardware must have significant processing capabilities. FPGA improves the speed of signal processing using parallelism and reconfigurability advantages. The objective of this thesis is to develop an efficient hardware architectural model for the universal lattice decoder and prototype it on FPGA. The original algorithm is modified to ensure the high data rate via taking the advantage of FPGA features. The simulation results of software, hardware are verified and the BER performance of both the algorithms is estimated. The system prototype of the decoder with 4-transmit and 4-receive antennas using a 4-PAM (Pulse amplitude modulation) supports 6.32 Mbit/s data rate for parallelpipeline implementation on FPGA platform, which is about two orders of magnitude faster than its DSP implementation

    Pipelined Implementation of a Fixed-Point Square Root Core Using Non-Restoring and Restoring Algorithm

    Get PDF
    Arithmetic Square Root is one of the most complex but nevertheless widely used operations in modern computing. A primary reason for the complexity is the irrational nature of the square root for non-perfect numbers and the iterative behavior required for square root computation. A typical RISC implementation of Square Root Computation can take anywhere from 200 - 300 cycles. If significant usage is encountered, this could result in an impact in run-time cost which would justify a direct hardware implementation that achieves the same result in as little as 20 clock cycles. Additionally, the implementation is pipelined to achieve even greater throughput compared to an instruction based implementation. The paper thus presents an efficient, pipelined implementation of a square root calculation core which implements a non-restoring algorithm of determining the square-root. The iteration count of the algorithm depends on the maximum size of the input and the desired resolution. A specific case of a 16-bit integer square root calculator with output resolution 0.001 is considered which requires a total of 18 iterations of the algorithm. In the implementation, each iteration is pipelined as a stage thereby resulting in an 18-stage pipelined square root computation core. The proposed algorithm utilizes standard arithmetic operations like addition, subtraction, shift and basic control statements to determine the output of each stage. The core is verified using SystemVerilog test-bench. The test-bench generates unconstrained random inputs stimulus and determines the expected value from the core device under test (DUT) by evaluating a Simulink generated model for the same stimulus. Functional coverage, implemented in the test-bench, determines reliability of the system and consequently the duration of the test execution

    RESOURCE EFFICIENT DESIGN OF QUANTUM CIRCUITS FOR CRYPTANALYSIS AND SCIENTIFIC COMPUTING APPLICATIONS

    Get PDF
    Quantum computers offer the potential to extend our abilities to tackle computational problems in fields such as number theory, encryption, search and scientific computation. Up to a superpolynomial speedup has been reported for quantum algorithms in these areas. Motivated by the promise of faster computations, the development of quantum machines has caught the attention of both academics and industry researchers. Quantum machines are now at sizes where implementations of quantum algorithms or their components are now becoming possible. In order to implement quantum algorithms on quantum machines, resource efficient circuits and functional blocks must be designed. In this work, we propose quantum circuits for Galois and integer arithmetic. These quantum circuits are necessary building blocks to realize quantum algorithms. The design of resource efficient quantum circuits requires the designer takes into account the gate cost, quantum bit (qubit) cost, depth and garbage outputs of a quantum circuit. Existing quantum machines do not have many qubits meaning that circuits with high qubit cost cannot be implemented. In addition, quantum circuits are more prone to errors and garbage output removal adds to overall cost. As more gates are used, a quantum circuit sees an increased rate of failure. Failures and error rates can be countered by using quantum error correcting codes and fault tolerant implementations of universal gate sets (such as Clifford+T gates). However, Clifford+T gates are costly to implement with the T gate being significantly more costly than the Clifford gates. As a result, designers working with Clifford+T gates seek to minimize the number of T gates (T-count) and the depth of T gates (T-depth). In this work, we propose quantum circuits for Galois and integer arithmetic with lower T-count, T-depth and qubit cost than existing work. This work presents novel quantum circuits for squaring and exponentiation over binary extension fields (Galois fields of form GF(2 m )). The proposed circuits are shown to have lower depth, qubit and gate cost to existing work. We also present quantum circuits for the core operations of multiplication and division which enjoy lower T-count, T-depth and qubit costs compared to existing work. This work also illustrates the design of a T-count and qubit cost efficient design for the square root. This work concludes with an illustration of how the arithmetic circuits can be combined into a functional block to implement quantum image processing algorithms

    Neuromorphic deep convolutional neural network learning systems for FPGA in real time

    Get PDF
    Deep Learning algorithms have become one of the best approaches for pattern recognition in several fields, including computer vision, speech recognition, natural language processing, and audio recognition, among others. In image vision, convolutional neural networks stand out, due to their relatively simple supervised training and their efficiency extracting features from a scene. Nowadays, there exist several implementations of convolutional neural networks accelerators that manage to perform these networks in real time. However, the number of operations and power consumption of these implementations can be reduced using a different processing paradigm as neuromorphic engineering. Neuromorphic engineering field studies the behavior of biological and inner systems of the human neural processing with the purpose of design analog, digital or mixed-signal systems to solve problems inspired in how human brain performs complex tasks, replicating the behavior and properties of biological neurons. Neuromorphic engineering tries to give an answer to how our brain is capable to learn and perform complex tasks with high efficiency under the paradigm of spike-based computation. This thesis explores both frame-based and spike-based processing paradigms for the development of hardware architectures for visual pattern recognition based on convolutional neural networks. In this work, two FPGA implementations of convolutional neural networks accelerator architectures for frame-based using OpenCL and SoC technologies are presented. Followed by a novel neuromorphic convolution processor for spike-based processing paradigm, which implements the same behaviour of leaky integrate-and-fire neuron model. Furthermore, it reads the data in rows being able to perform multiple layers in the same chip. Finally, a novel FPGA implementation of Hierarchy of Time Surfaces algorithm and a new memory model for spike-based systems are proposed

    20-GFLOPS QR processor on a Xilinx Virtex-E FPGA

    Get PDF
    corecore