13 research outputs found

    Enhancing Precision in Cloud Computing: Implementation of a Novel Floating point Format on FPGA

    Get PDF
    In this paper, we propose a Internet-based Cloud computing service that provides computing, storage and networking services to multiple users. Computing capacity runs out quickly in cloud computing services with the increase in data size. To fill the shortage of computation capacity, we propose to adopt variable precision by implementing unum (universal number). Unum is a number format different from IEEE Standard for Floating-Point Arithmetic � IEEE 754 floats. Compared with IEEE 754 floats, the outstanding features of unum are clearance of rounding errors, high information-per-bit and variable precision. As a candidate replacement of IEEE 754 floats, the application of unum can improve the precision in computing. It decreases the bit width for high precision numbers. However, unum was only implemented in software model before due to technical complexity, in order to validate the performance on chip, we implement this arithmetic on FPGA for the very first time. We also implement an unum based 16-point FFT on FPGA. We validate the design and compare the bit width in computing with IEEE 754 floats. The experimental results of comparison show that unum arithmetic can ensure correctness even in some extreme arithmetic cases in which IEEE 754 floats cannot work properly, furthermore the bit width of unum is much less than IEEE 754 floats in the same precision

    Power-Efficient Time-Domain Dispersion Compensation Using Optimized FIR Filter Implementation

    Get PDF
    We investigate fixed-point aspects and time-domain ASIC implementations of CD compensation. An optimized implementation gives significant power dissipation reduction for short links, with further reduction if pulse shaping is considered

    Run-Time Reconfigurable FFT Engine

    Get PDF
    This paper develops a system level architecture for implementing a cost-efficient, FPGA-based realtime FFT engine. This approach considers both the hardware cost (in terms of FPGA resource requirements), and performance (in terms of throughput). These two dimensions are optimized based on using run time reconfiguration, double buffering technique and the hardware virtualization to reuse the available processing components. The system employs sixteen reconfigurable parallel FFT cores. Each core represents a 16 complex point parallel FFT processor, running in continuous realtime FFT engine. The architecture support transform length of 256 complex points, as a demonstrator to the idea design, using fixed-point arithmetic and has been developed using radix-4 architecture. The parallel Booth technique for realizing the complex multiplier (required in the basic butterfly operation) is chosen. That is to save a lot of hardware compared to other techniques. The simulation results that have been performed using VHDL modeling language and ModelSim software shows that the full design can be implemented using single FPGA platform requiring about 50,000 Slices

    System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing

    Get PDF
    This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications. Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance. This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB. Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy). The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption. Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude

    Implementing FFT-based digital channelized receivers on FPGA platforms

    Get PDF
    This paper presents an in-depth study of the implementation and characterization of fast Fourier transform (FFT) pipelined architectures suitable for broadband digital channelized receivers. When implementing the FFT algorithm on field-programmable gate array (FPGA) platforms, the primary goal is to maximize throughput and minimize area. Feedback and feedforward architectures have been analyzed regarding key design parameters: radix, bitwidth, number of points and stage scaling. Moreover, a simplification of the FFT algorithm, the monobit FFT, has been implemented in order to achieve faster real time performance in broadband digital receivers. The influence of the hardware implementation on the performance of digital channelized receivers has been analyzed in depth, revealing interesting implementation trade-offs which should be taken into account when designing this kind of signal processing systems on FPGA platforms

    Generating and Searching Families of FFT Algorithms

    Full text link
    A fundamental question of longstanding theoretical interest is to prove the lowest exact count of real additions and multiplications required to compute a power-of-two discrete Fourier transform (DFT). For 35 years the split-radix algorithm held the record by requiring just 4n log n - 6n + 8 arithmetic operations on real numbers for a size-n DFT, and was widely believed to be the best possible. Recent work by Van Buskirk et al. demonstrated improvements to the split-radix operation count by using multiplier coefficients or "twiddle factors" that are not n-th roots of unity for a size-n DFT. This paper presents a Boolean Satisfiability-based proof of the lowest operation count for certain classes of DFT algorithms. First, we present a novel way to choose new yet valid twiddle factors for the nodes in flowgraphs generated by common power-of-two fast Fourier transform algorithms, FFTs. With this new technique, we can generate a large family of FFTs realizable by a fixed flowgraph. This solution space of FFTs is cast as a Boolean Satisfiability problem, and a modern Satisfiability Modulo Theory solver is applied to search for FFTs requiring the fewest arithmetic operations. Surprisingly, we find that there are FFTs requiring fewer operations than the split-radix even when all twiddle factors are n-th roots of unity.Comment: Preprint submitted on March 28, 2011, to the Journal on Satisfiability, Boolean Modeling and Computatio

    Studio e progettazione VLSI di un acceleratore hardware per la ricostruzione di ambienti sonori

    Get PDF
    I recenti sviluppi dei sistemi di misura nell’ambito della caratterizzazione acustica degli ambienti consentono una descrizione sempre più accurata delle proprietà acustiche degli spazi per mezzo della risposta impulsiva. La creazione di un database contenente questo tipo di informazioni riveste un ruolo estremamente importante sia dal punto di vista della conservazione dei beni culturali, sia dal punto di vista della produzione audio cinematografica e musicale. L’utilizzo della risposta impulsiva, infatti, permette la ricostruzione sonora del contesto acustico, sia in fase sia di registrazione che di riproduzione, a partire da segnali pseudo-anecoici. L’elaborazione di questi dati risulta estremamente onerosa dal punto di vista computazionale a causa delle lunghezze delle risposte impulsive, che possono raggiungere durate dell’ordine delle decine di secondi. In questo lavoro di tesi viene affrontato lo studio e la progettazione di un acceleratore hardware per la convoluzione tra un segnale audio pseudo-anecoico ed una risposta impulsiva di un ambiente sonoro. L’obiettivo è la realizzazione di una piattaforma dedicata per applicazioni audio professionali. Dopo aver preso visione dello stato dell’arte, si è proceduto ad un’analisi di alto livello del sistema per determinare le specifiche di massima. In seguito, il sistema è stato partizionato nelle sue componenti hardware e software e si è operata la scelta per la piattaforma di prototipazione. Il sottosistema di convoluzione hardware è stato quindi descritto in linguaggio VHDL e verificato funzionalmente. Lo stesso codice è stato sintetizzato e verificato su dispositivi FPGA, assieme alle interfacce appositamente sviluppate per la comunicazione verso il sottosistema software

    Universal FFT core generator

    Get PDF
    This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform (DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be configured to compute different dimensional DFTs simply by rearranging the input data and using a different set of constants, called generalized twiddle factors, throughout the computation. The basic computational process remains the same independent of dimension. The processor does not rely on one-dimensional FFT computation as does the standard approach using the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose computation data flow is the same as the FFT algorithm due to Pease.Since the computation performed by the Universal FFT processor is the same as the Pease FFT, an implementation can be obtained by modifying a one-dimensional FFT processor based on the Pease algorithm. The implementation presented in this thesis is obtained in this way from the core generator by Nordin, Milder, Hoe, and P¨uschel. The core generator from Nordin, Milder, Hoe, and P¨uschel generates a variety of cores, with various space and time tradeoffs for implementation on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor supplied cores but offers much greater flexibility of design choices with different area and throughput tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional FPGA cores with approximately the same performance and area obtained in the one-dimensional case by Nordin, Milder, Hoe, and P¨uschel.M.S., Computer Science -- Drexel University, 200
    corecore