6 research outputs found

    Reflections on 10 years of FloPoCo

    Get PDF
    International audienceThe FloPoCo open-source arithmetic core generator project started modestly in 2008 [1], with a few parametric floating point cores. It has since then evolved to become a framework for research on hardware arithmetic cores at large, including among others: LNS arithmetic [2], random number generators [3], elementary functions [4]–[9], specialized operators such as constant multiplication and division [10]–[13], various FPGA-specific optimization techniques [14]–[16], and more recently signal-processing transforms and filters [17], [18] (more references can be found on the project’s web site: http://flopoco.gforge.inria.fr/)

    A Novel VLSI Design On CSKA Of Binary Tree Adder With Compaq Area And High Throughput

    Get PDF
    Addition is one of the most basic operations performed in all computing units, including microprocessors and digital signal processors. It is also a basic unit utilized in various complicated algorithms of multiplication and division. Efficient implementation of an adder circuit usually revolves around reducing the cost to propagate the carry between successive bit positions. Multi-operand adders are important arithmetic design blocks especially in the addition of partial products of hardware multipliers. The multi-operand adders (MOAs) are widely used in the modern low-power and high-speed portable very-large-scale integration systems for image and signal processing applications such as digital filters, transforms, convolution neural network architecture. Hence, a new high-speed and area efficient adder architecture is proposed using pre-compute bitwise addition followed by carry prefix computation logic to perform the three-operand binary addition that consumes substantially less area, low power and drastically reduces the adder delay. Further, this project is enhanced by using Modified carry bypass adder to further reduce more density and latency constraints. Modified carry skip adder introduces simple and low complex carry skip logic to reduce parameters constraints. In this proposal work, designed binary tree adder (BTA) is analyzed to find the possibilities for area minimization. Based on the analysis, critical path of carry is taken into the new logic implementation and the corresponding design of CSKP are proposed for the BTA with AOI, OAI

    LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations

    Full text link
    We propose two tiers of modifications to FPGA logic cell architecture to deliver a variety of performance and utilization benefits with only minor area overheads. In the irst tier, we augment existing commercial logic cell datapaths with a 6-input XOR gate in order to improve the expressiveness of each element, while maintaining backward compatibility. This new architecture is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor tree synthesis using generalized parallel counters (GPCs) is further improved with the proposed modifications. Using both the Intel adaptive logic module and the Xilinx slice at the 65nm technology node for a comparative study, it is shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We demonstrate that LUXOR can deliver an average reduction of 13-19% in logic utilization on micro-benchmarks from a variety of domains.BNN benchmarks benefit the most with an average reduction of 37-47% in logic utilization, which is due to the highly-efficient mapping of the XnorPopcount operation on our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA, US

    Karatsuba with Rectangular Multipliers for FPGAs

    Get PDF
    Best paper awardInternational audienceThis work presents an extension of Karatsuba's method to efficiently use rectangular multipliers as a base for larger multipliers. The rectangular multipliers that motivate this work are the embedded 18×25-bit signed multipliers found in the DSP blocks of recent Xilinx FPGAs: The traditional Karatsuba approach must under-use them as square 18×18 ones. This work shows that rectangular multipliers can be efficiently exploited in a modified Karatsuba method if their input word sizes have a large greatest common divider. In the Xilinx FPGA case, this can be obtained by using the embedded multipliers as 16×24 unsigned and as 17×25 signed ones. The obtained architectures are implemented with due detail to architectural features such as the pre-adders and post-adders available in Xilinx DSP blocks. They are synthesized and compared with traditional Karatsuba, but also with (non-Karatsuba) state-of-the-art tiling techniques that make use of the full rectangular multipliers. The proposed technique improves resource consumption and performance for multipliers of numbers larger than 64 bits

    Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing

    Full text link
    Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication-dependent applications can use reduced-precision integer or fixed-point representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. BISMO, a vectorized bit-serial matrix multiplication overlay for reconfigurable computing, previously utilized the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an arithmetic architecture that better utilizes 6-LUTs. The improved BISMO achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper arXiv:1806.0886

    Resource Optimal Truncated Multipliers for FPGAs

    Get PDF
    International audienceThis proposal presents the resource optimal design of truncated multipliers targeting field programmable gate arrays (FPGAs). In contrast to application specific integrated circuits (ASICs), the design for FPGAs has some distinct design challenges due to many possibilities of computing the partial products using logic-based or DSP-based sub-multipliers. To tackle this, we extend a previously proposed tiling methodology which translates the multiplier design into a geometrical problem: the target multiplier is represented by a board that has to be covered by tiles representing the sub-multipliers. The tiling with the least resources can be found with integer linear programming (ILP). Our extension considers the error of possibly unoccupied positions of the board and determines the tiling with the least resources that respects the maximal allowed error bound. This error bound is chosen such that a faithfully rounded truncated multiplier is obtained. Compared to previous designs that use a fixed number of guard bits or optimize at the level of the dot diagrams, this allows a much better use of sub-multipliers resulting in significant area savings without sacrificing the timing
    corecore