260 research outputs found

    Towards an optimised VLSI design algorithm for the constant matrix multiplication problem

    Get PDF
    The efficient design of multiplierless implementations of constant matrix multipliers is challenged by the huge solution search spaces even for small scale problems. Previous approaches tend to use hill-climbing algorithms risking sub-optimal results. The proposed algorithm avoids this by exploring parallel solutions. The computational complexity is tackled by modelling the problem in a format amenable to genetic programming and hardware acceleration. Results show an improvement on state of the art algorithms with future potential for even greater savings

    Multiplierless CSD techniques for high performance FPGA implementation of digital filters.

    Get PDF
    I leverage FastCSD to develop a new, high performance iterative multiplierless structure based on a novel real-time CSD recoding, so that more zero partial products are introduced. Up to 66.7% zero partial products occur compared to 50% in the traditional modified Booth's recoding. Also, this structure reduces the non-zero partial products to a minimum. As a result, the number of arithmetic operations in the carry-save structure is reduced. Thus, an overall speed-up, as well as low-power consumption can be achieved. Furthermore, because the proposed structure involves real time CSD recoding and does not require a fixed value for the multiplier input to be known a priori, the proposed multiplier can be applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters.My work is based on a dramatic new technique for converting between 2's complement and CSD number systems, and results in high-performance structures that are particularly effective for implementing adaptive systems in reconfigurable logic.My research focus is on two key ideas for improving DSP performance: (1) Develop new high performance, efficient shift-add techniques ("multiplierless") to implement the multiply-add operations without the need for a traditional multiplier structure. (2) There is a growing trend toward design prototyping and even production in FPGAs as opposed to dedicated DSP processors or ASICs; leverage this trend synergistically with the new multiplierless structures to improve performance.Implementation of digital signal processing (DSP) algorithms in hardware, such as field programmable gate arrays (FPGAs), requires a large number of multipliers. Fast, low area multiply-adds have become critical in modern commercial and military DSP applications. In many contemporary real-time DSP and multimedia applications, system performance is severely impacted by the limitations of currently available speed, energy efficiency, and area requirement of an onboard silicon multiplier.I also introduce a new multi-input Canonical Signed Digit (CSD) multiplier unit, which requires fewer shift/add/subtract operations and reduced CSD number conversion overhead compared to existing techniques. This results in reduced power consumption and area requirements in the hardware implementation of DSP algorithms. Furthermore, because all the products are produced simultaneously, the multiplication speed and thus the throughput are improved. The multi-input multiplier unit is applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters. The implementation cost of these digital filters can be further reduced by limiting the wordlength of the input signal with little or no sacrifice to the filter performance, which is confirmed by my simulation results. The proposed multiplier unit can also be applied to other DSP algorithms, such as digital filter banks or matrix and vector multiplications.Finally, the tradeoff between filter order and coefficient length in the design and implementation of high-performance filters in Field Programmable Gate Arrays (FPGAs) is discussed. Non-minimum order FIR filters are designed for implementation using Canonical Signed Digit (CSD) multiplierless implementation techniques. By increasing the filter order, the length of the coefficients can be decreased without reducing the filter performance. Thus, an overall hardware savings can be achieved.Adaptive system implementations require real-time conversion of coefficients to Canonical Signed Digit (CSD) or similar representations to benefit from multiplierless techniques for implementing filters. Multiplierless approaches are used to reduce the hardware and increase the throughput. This dissertation introduces the first non-iterative hardware algorithm to convert 2's complement numbers to their CSD representations (FastCSD) using a fixed number of shift and logic operations. As a result, the power consumption and area requirements required for hardware implementation of DSP algorithms in which the coefficients are not known a priori can be greatly reduced. Because all CSD digits are produced simultaneously, the conversion speed and thus the throughput are improved when compared to overlap-and-scan techniques such as Booth's recoding

    Reconfigurable Adaptive Multiple Transform Hardware Solutions for Versatile Video Coding

    Get PDF
    Computer aided design is nowadays a must to quickly provide optimized circuits, to cope with stringent time to market constraints, and to be able to guarantee colliding constrained requirements. Design automation is exploited, whenever possible, to speed up the design process and relieve the developers from error prone customization, optimization and tuning phases. In this work we study the possibility of adopting automated algorithms for the optimization of reconfigurable multiple constant multiplication circuits. In particular, an exploration of novel reconfigurable Adaptive Multiple Transform circuital solutions adoptable in video coding applications has been conducted. These solutions have also been compared with the unique similar work at the state of the art, revealing to be beneficial under certain constraints. Moreover, the proposed approach has been generalized with some guidelines helpful to designers facing similar problems

    Towards the Multiple Constant Multiplication at Minimal Hardware Cost

    Full text link
    Multiple Constant Multiplication (MCM) over integers is a frequent operation arising in embedded systems that require highly optimized hardware. An efficient way is to replace costly generic multiplication by bit-shifts and additions, i.e. a multiplierless circuit. In this work, we improve the state-of-the-art optimal approach for MCM, based on Integer Linear Programming (ILP). We introduce a new lower-level hardware cost, based on counting the number of one-bit adders and demonstrate that it is strongly correlated with the LUT count. This new model for the multiplierless MCM circuits permitted us to consider intermediate truncations that permit to significantly save resources when a full output precision is not required. We incorporate the error propagation rules into our ILP model to guarantee a user-given error bound on the MCM results. The proposed ILP models for multiple flavors of MCM are implemented as an open-source tool and, combined with the FloPoCo code generator, provide a complete coefficient-to-VHDL flow. We evaluate our models in extensive experiments, and propose an in-depth analysis of the impact that design metrics have on actually synthesized hardware.Comment: 10 pages, 3 tables, 6 figures, journal submissio

    A new Low-Power recoding algorithm for multiplierless single/multiple constant multiplication.

    No full text
    International audienceOptimizing the number of additions in constant coefficient multiplication is conjectured to be a NP-hard problem. In this paper, we report a new heuristic requiring an average of 29.10 % and 10.61 % less additions than the standard canonical signed digit representation (CSD) and the double base number system (DBNS), respectively, for 64-bit coefficients. The maximum number of additions per coefficient is bounded by (N/4)+2, and the time-complexity of the recoding is linearly proportional to N, where N is the bit-size of the constant. These performances are achieved using a new redundant version of radix-28 recoding

    Evolutionary design of digital VLSI hardware

    Get PDF

    Efficient Correlator for OFDM Synchronization on FPGA Implementation

    Get PDF
    Orthogonal frequency division multiplexing (OFDM) is a viable technology for high-speed data transmission by virtue of its spectral efficiency and robustness to multi-path fading. These advantages can be achieved only with good synchronization both in time and frequency. The existing system consist of pipeline structure of correlator using DSP48E1 slices but it consumes a large amount of area and power and also it introduces a delay. Hence for overcoming these problems we proposed a new model. The proposed model is designed using a custom designed hardware instead of using DSP slices. It reduces area consumption, delay, power consumption and also it can be used in any FPGA architecture. DOI: 10.17762/ijritcc2321-8169.15070

    Programmable architectures for the automated design of digital FIR filters using evolvable hardware

    Get PDF

    Design of multiplierless correlators for timing synchronization in IEEE 802.11a wireless LANs

    Get PDF
    Timing synchronization for IEEE 802.11a WLANs requires using a correlator to correlate the received signal with a known waveform. Straightforward implementation of this correlator results in the need to perform 320 million complex multiplications per second. This significant requirement can be eliminated by using multiplierless correlators. In this paper, multiplierless correlators are designed based on constraining the real and imaginary parts of correlator coefficients to be sums of powers of two. Sets of coefficients that yield good synchronization performance for simple A WGN channels are first identified; then their goodness for indoor communication environments is verified by simulation for multipath fading channels. Several multiplierless correlators are found. Comparison among these correlators identifies a good one that requires to perform only 26 addition/subtraction operations per correlator output while a similar synchronization performance can be maintained.published_or_final_versio
    • …
    corecore