34 research outputs found

    Low Latency Prefix Accumulation Driven Compound MAC Unit for Efficient FIR Filter Implementation

    Get PDF
    135–138This article presents hierarchical single compound adder-based MAC with assertion based error correction for speculation variations in the prefix addition for FIR filter design. The VLSI implementation of approximation in prefix adder results show a significant delay and complexity reductions, all this at the cost of latency measures when speculation fails during carry propagation, which is the main reason preventing the use of speculation in parallel-prefix adders in DSP applications. The speculative adder which is based on Han Carlson parallel prefix adder structure accomplishes better reduction in latency. Introducing a structured and efficient shift-add technique and explore latency reduction by incorporating approximation in addition. The improvements made in terms of reduction in latency and merits in performance by the proposed MAC unit are showed through the synthesis done by FPGA hardware. Results show that proposed method outpaces both formerly projected MAC designs using multiplication methods for attaining high speed

    Optimistic Parallelization of Floating-Point Accumulation

    Get PDF
    Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks

    VLSI Circuits for Approximate Computing

    Get PDF
    Approximate Computing has recently emerged as a promising solution to enhance circuits performance by relaxing the requisite on exact calculations. Multimedia and Machine Learning constitute a typical example of error resilient, albeit compute-intensive, applications. In this dissertation, the design and optimization of approximate fundamental VLSI digital blocks is investigated. In chapter one the theoretical motivations of Approximate Computing, from the VLSI perspective, are discussed. In chapter two my research activity about approximate adders is reported. In this chapter approximate adders for both traditional non-error tolerant applications and error resilient applications are discussed. In chapter three precision-scalable units are investigated. Real-time precision scalability allows adapting the precision level of the unit with the precision requirements of the applications. In this context my research activities regarding approximate Multiply-and-Accumulate and memory units are described. In chapter four a precision-scalable approximate convolver for computer vision applications is discussed. This is composed of both the approximate Multiply-and-Accumulate and memory units, presented in the chapter three

    Low Latency Prefix Accumulation Driven Compound MAC Unit for Efficient FIR Filter Implementation

    Get PDF
    This article presents hierarchical single compound adder-based MAC with assertion based error correction for speculation variations in the prefix addition for FIR filter design. The VLSI implementation of approximation in prefix adder results show a significant delay and complexity reductions, all this at the cost of latency measures when speculation fails during carry propagation, which is the main reason preventing the use of speculation in parallel-prefix adders in DSP applications. The speculative adder which is based on Han Carlson parallel prefix adder structure accomplishes better reduction in latency. Introducing a structured and efficient shift-add technique and explore latency reduction by incorporating approximation in addition. The improvements made in terms of reduction in latency and merits in performance by the proposed MAC unit are showed through the synthesis done by FPGA hardware. Results show that proposed method outpaces both formerly projected MAC designs using multiplication methods for attaining high speed

    Design of Approximate Circuits by Fabrication of False Timing Paths: The Carry Cut-Back Adder

    Get PDF
    This paper introduces a novel method for designing approximate circuits by fabricating and exploiting false timing paths, i.e. critical paths that cannot be logically activated. This allows to strongly relax timing constraints while guaranteeing minimal and controlled behavioral change. This technique is applied to an approximate adder architecture, called the Carry Cut-Back Adder (CCBA), in which high-significance stages can cut the carry propagation chain at lower-significance positions. This lightweight approach prevents the logic activation of the carry chain, improving performance and energy efficiency while guaranteeing low worst-case errors. A design methodology is presented along with implementation, error optimization and design-space minimization. The CCBA is proven capable of extremely high accuracy while displaying significant circuit savings. For a worst-case precision of 99.999%, energy savings up to 36% are demonstrated compared to exact adders. Finally, an industry-oriented comparison of 32-bit approximate and truncated adders is carried out for mean and worst-case relative errors. The CCBA outperforms both state-of-the-art and truncated adders for high-accuracy and low-power circuits, confirming the interest of the proposed concept to help building highly-efficient approximate or precision-scalable hardware accelerators

    Trading off Energy versus Accuracy in Modern Computing Systems:From Digital Circuit Design to Programming Techniques

    Get PDF
    The slowdown of Moore's law, which has been the driving force of the electronics industry over the last 5 decades, is causing serious problem to Integrated Circuits (ICs) improvements. Technology scaling is becoming more and more complex and fabrication costs are growing exponentially. Furthermore, the energy gains associated to technology scaling are slowing down. Meanwhile, the expected boom of Internet of Things (IoT) devices requires ultra-low power ICs to be able to operate for several years without any user intervention, and energy-efficient computing system on the server side to treat all the gathered data. Approximate computing has emerged as an alternative way to improve energy-efficiency of both, high-performance and low-power computing systems by tolerating small and occasional errors. This energy-accuracy tradeoff can be applied on a wide range of over-engineered applications, particularly those involving human senses such as video and image processing. This thesis first presents an approximate circuit design technique called Gate-Level Pruning, which consists in selectively removing logic gates from any conventional circuit in order to reduce energy consumption, critical path delay, and area occupied on silicon. A Computer Aided Design (CAD) tool has been developed and integrated in the standard digital flow and has been evaluated on several arithmetic circuits, achieving up to 78% energy-delay-area savings. It is then shown how this methodology can be applied on more complex systems made of multiple arithmetic blocks but also memory: the discrete Cosine Transform(DCT), which is a key building block for image and video processing applications. Then, the speculative adder technique is presented. It consists in cutting carry chains to significantly relax the circuit timing constraints', and therefore drastically reduce energy consumption, area and delay. It is shown that this technique leads to errors of different nature than those produced by gate-level pruning. It is therefore worth combining GLP and speculative adders to obtain even higher savings. This has been verified on IEEE-754 floating point units integrated in a 65nm process within a low-power multi-core processor. Silicon measurements show up to 27% power, 36% area and 53% power-area savings. The second part of this thesis introduces software techniques to achieve similar energy-accuracy tradeoffs on commercially available processors. By switching from double precision to single precision floating-point data type and by exploiting vectorization capabilities of modern processors, a factor 2 energy can be saved on a Newton method for solving nonlinear equations. To further investigate the origins of these savings, an energy model based on Energy Per Instructions (EPI) has been built. It turns out that less than 6% of the total energy is consumed by arithmetic operations and that savings are achieved mainly by reducing the amount of data transferred between registers, cache and main memory. One way to reduce those power-hungry data movements is to use application specific hardware accelerators. Unfortunately, a commercial processor cannot embark accelerators for all the possible applications. To that extent, hardware accelerators are implemented on a Field Programmable Gate Array (FPGA) interconnected with a general-purpose processor to further reduce the energy consumption

    High-level synthesis of VLSI circuits

    Get PDF

    A new floating-point adder FPGA-based implementation using RN-coding of numbers

    Get PDF
    A well-known problem in the computer science area is related to numerical data representation, which directly affects adder circuits’ design and a reason to have different formats: IEEE Std. 754, Half-Unit-Biased (HUB), and Round-to-Nearest (RN). RN has an advantage that rounding to nearest is equivalent to a word truncation. It avoids double rounding errors and intermediate rounding steps with an exact conversion between formats, making it applicable to general problems. However, there is a lack of research on the hardware implementation of the RN representation. In this work, we propose hardware architectures for binary and floating-point adders, analyzing for the latter its performance in terms of error and resource consumption in FPGAs. To accomplish this, we have developed a one-bit RN-based adder that allows modular designs, considering an efficient signal propagation to obtain new architectures for both binary and floating-point single-precision adders. The results open new perspectives for further applications

    Optimal digital system design in deep submicron technology

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 165-174).The optimization of a digital system in deep submicron technology should be done with two basic principles: energy waste reduction and energy-delay tradeoff. Increased energy resources obtained through energy waste reduction are utilized through energy-delay tradeoffs. The previous practice of obliviously pursuing performance has led to the rapid increase in energy consumption. While energy waste due to unnecessary switching could be reduced with small increases in logic complexity, leakage energy waste still remains as a major design challenge. We find that fine-grain dynamic leakage reduction (FG-DLR), turning off small subblocks for short idle intervals, is the key for successful leakage energy saving. We introduce an FG-DLR circuit technique, Leakage Biasing, which uses leakage currents themselves to bias the circuit into the minimum leakage state, and apply it to primary SRAM arrays for bitline leakage reduction (Leakage-Biased Bitlines) and to domino logic (Leakage-Biased Domino). We also introduce another FG-DLR circuit technique, Dynamic Resizing, which dynamically downsizes transistors on idle paths while maintaining the performance along active critical paths, and apply it to static CMOS circuits.(cont.) We show that significant energy reduction can be achieved at the same computation throughput and communication bandwidth by pipelining logic gates and wires. We find that energy saved by pipelining datapaths is eventually limited by latch energy overhead, leading to a power-optimal pipelining. Structuring global wires into on-chip networks provides a better environment for pipelining and leakage energy saving. We show that the energy-efficiency increase through replacement with dynamically packet-routed networks is bounded by router energy overhead. Finally, we provide a way of relaxing the peak power constraint. We evaluate the use of Activity Migration (AM) for hot spot removal. AM spreads heat by transporting computation to a different location on the die. We show that AM can be used either to increase the power that can be dissipated by a given package, or to lower the operating temperature and hence the operating energy.by Seongmoo Heo.Ph.D
    corecore