1,104 research outputs found

    Architectural choices for the Columbia 0.8 Teraflops machine

    Full text link
    We discuss the hardware design choices made in our 16K-node 0.8 Teraflops supercomputer project, a machine architecture optimized for full QCD calculations. The efficiency of the conjugate gradient algorithm in terms of balance of floating-point operations, memory handling and utilization, and communication overhead is addressed. We also discuss the technological innovations and software tools that facilitate hardware design and what opportunities these give to the academic community.Comment: Contribution to Lattice 94. 3 pages. Latex source followed by compressed, uuenocded postscript file of the complete pape

    RTL fast convolution using the mersenne number transform

    Get PDF
    VHDL is a versatile high level language for the specification and simulation of hardware components. Here a functional VHDL model is presented for performing fast convolution based on Mersenne's number theoretic transform.\nFor filtering a rather long input sequence xn() we can decomposed it into a number of short segments, each of which can be processed individually. The output yn()then becomes a combination of partial convolutions. The superposition principle for linear operators is used here.\nEach partial convolution can be solved using the Discrete Fourier Transform (DFT) implementing a fast FFT (Fast Fourier Transform) algorithm. This DFT approach is the most popular.\nIn this paper we use the Mersenne Number Transform (MNT) as an alternative for the DFT in the framework of a register transfer level (RTL) implementation of the filter operation. Even when the MNT does not have a fast algorithm it can be see that RTL in the natural level of abstraction for the implementation of the MNT.\nThis work is conceived as part of an academic exercise in the use of VHDL for modeling a DSP algorithm all the way from the mathematical specification to the circuit implementation.Eje: Procesamiento distribuido y paralelo. Tratamiento de señale

    RTL fast convolution using the mersenne number transform

    Get PDF
    VHDL is a versatile high level language for the specification and simulation of hardware components. Here a functional VHDL model is presented for performing fast convolution based on Mersenne's number theoretic transform. For filtering a rather long input sequence xn() we can decomposed it into a number of short segments, each of which can be processed individually. The output yn()then becomes a combination of partial convolutions. The superposition principle for linear operators is used here. Each partial convolution can be solved using the Discrete Fourier Transform (DFT) implementing a fast FFT (Fast Fourier Transform) algorithm. This DFT approach is the most popular. In this paper we use the Mersenne Number Transform (MNT) as an alternative for the DFT in the framework of a register transfer level (RTL) implementation of the filter operation. Even when the MNT does not have a fast algorithm it can be see that RTL in the natural level of abstraction for the implementation of the MNT. This work is conceived as part of an academic exercise in the use of VHDL for modeling a DSP algorithm all the way from the mathematical specification to the circuit implementation.Eje: Procesamiento distribuido y paralelo. Tratamiento de señalesRed de Universidades con Carreras en Informåtica (RedUNCI

    A low cost reconfigurable soft processor for multimedia applications: design synthesis and programming model

    Get PDF
    This paper presents an FPGA implementation of a low cost 8 bit reconfigurable processor core for media processing applications. The core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. This paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. The core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. Throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented

    Simulink modeling and design of an efficient hardware-constrained FPGA-based PMSM speed controller

    Get PDF
    The aim of this paper is to present a holistic approach to modeling and FPGA implementation of a permanent magnet synchronous motor (PMSM) speed controller. The whole system is modeled in the Matlab Simulink environment. The controller is then translated to discrete time and remodeled using System Generator blocks, directly synthesizable into FPGA hardware. The algorithm is further refined and factorized to take into account hardware constraints, so as to fit into a low cost FPGA, without significantly increasing the execution time. The resulting controller is then integrated together with sensor interfaces and analysis tools and implemented into an FPGA device. Experimental results validate the controller and verify the design

    FPGA Implementation of an Adaptive Noise Canceller for Robust Speech Enhancement Interfaces

    Get PDF
    This paper describes the design and implementation results of an adaptive Noise Canceller useful for the construction of Robust Speech Enhancement Interfaces. The algorithm being used has very good performance for real time applications. Its main disadvantage is the requirement of calculating several operations of division, having a high computational cost. Besides that, the accuracy of the algorithm is critical in fixed-point representation due to the wide range of the upper and lower bounds of the variables implied in the algorithm. To solve this problem, the accuracy is studied and according to the results obtained a specific word-length has been adopted for each variable. The algorithm has been implemented for Altera and Xilinx FPGAs using high level synthesis tools. The results for a fixed format of 40 bits for all the variables and for a specific word-length for each variable are analyzed and discussed

    Pipelining Of Double Precision Floating Point Division And Square Root Operations On Field-programmable Gate Arrays

    Get PDF
    Many space applications, such as vision-based systems, synthetic aperture radar, and radar altimetry rely increasingly on high data rate DSP algorithms. These algorithms use double precision floating point arithmetic operations. While most DSP applications can be executed on DSP processors, the DSP numerical requirements of these new space applications surpass by far the numerical capabilities of many current DSP processors. Since the tradition in DSP processing has been to use fixed point number representation, only recently have DSP processors begun to incorporate floating point arithmetic units, even though most of these units handle only single precision floating point addition/subtraction, multiplication, and occasionally division. While DSP processors are slowly evolving to meet the numerical requirements of newer space applications, FPGA densities have rapidly increased to parallel and surpass even the gate densities of many DSP processors and commodity CPUs. This makes them attractive platforms to implement compute-intensive DSP computations. Even in the presence of this clear advantage on the side of FPGAs, few attempts have been made to examine how wide precision floating point arithmetic, particularly division and square root operations, can perform on FPGAs to support these compute-intensive DSP applications. In this context, this thesis presents the sequential and pipelined designs of IEEE-754 compliant double floating point division and square root operations based on low radix digit recurrence algorithms. FPGA implementations of these algorithms have the advantage of being easily testable. In particular, the pipelined designs are synthesized based on careful partial and full unrolling of the iterations in the digit recurrence algorithms. In the overall, the implementations of the sequential and pipelined designs are common-denominator implementations which do not use any performance-enhancing embedded components such as multipliers and block memory. As these implementations exploit exclusively the fine-grain reconfigurable resources of Virtex FPGAs, they are easily portable to other FPGAs with similar reconfigurable fabrics without any major modifications. The pipelined designs of these two operations are evaluated in terms of area, throughput, and dynamic power consumption as a function of pipeline depth. Pipelining experiments reveal that the area overhead tends to remain constant regardless of the degree of pipelining to which the design is submitted, while the throughput increases with pipeline depth. In addition, these experiments reveal that pipelining reduces power considerably in shallow pipelines. Pipelining further these designs does not necessarily lead to significant power reduction. By partitioning these designs into deeper pipelines, these designs can reach throughputs close to the 100 MFLOPS mark by consuming a modest 1% to 8% of the reconfigurable fabric within a Virtex-II XC2VX000 (e.g., XC2V1000 or XC2V6000) FPGA
    • 

    corecore