190 research outputs found

    Measuring Improvement when Using HUB Formats to Implement Floating-Point Systems under Round-to-Nearest

    Get PDF
    MEC bajo TIN2013-42253-PThis paper analyzes the benefits of using HUB formats to implement floating-point arithmetic under round-tonearest mode from a quantitative point of view. Using HUB formats to represent numbers allows the removal of the rounding logic of arithmetic units, including sticky-bit computation. This is shown for floating-point adders, multipliers, and converters. Experimental analysis demonstrates that HUB formats and the corresponding arithmetic units maintain the same accuracy as conventional ones. On the other hand, the implementation of these units, based on basic architectures, shows that HUB formats simultaneously improve area, speed, and power consumption. Specifically, based on data obtained from the synthesis, a HUB single-precision adder is about 14% faster but consumes 38% less area and 26% less power than the conventional adder. Similarly, a HUB single-precision multiplier is 17% faster, uses 22% less area, and consumes slightly less power than conventional multiplier. At the same speed, the adder and multiplier achieve area and power reductions of up to 50% and 40%, respectively

    Implementations of high performance architecture for IEEE 754 compliant floating-point adders

    Get PDF
    This thesis presents a direct iteration and implementation on a high per-formance architecture for IEEE 754 floating-point addition. This thesis improves on the previous architecture's implementation in a variety of sub-operations required for IEEE 754 floating-point addition, which are focused on directly improving critical path delay performance. A key element of this paper is the introduction of a flagged-prefix adder within the main carry-propagation path of an end-around-carry adder. It also provides detailed documentation for the design of IEEE 754 compliant floating-point adders. This is particularly emphasized for uncommon operations and control logic used throughout floating-point addition, including denormalized numbers and multi-precision logic. The full design for this architecture has support for binary16, binary32, and binary64 operations. The full extended range provided by denormalized IEEE 754 values is supported. It also has conversion support between IEEE 754 and two's complement integer values in either binary16, binary32, or binary64 precision. The performance comparisons shown are synthesis results in cmos32soi 32nm GF technology and ARM-based standard cells

    Readiness of Quantum Optimization Machines for Industrial Applications

    Full text link
    There have been multiple attempts to demonstrate that quantum annealing and, in particular, quantum annealing on quantum annealing machines, has the potential to outperform current classical optimization algorithms implemented on CMOS technologies. The benchmarking of these devices has been controversial. Initially, random spin-glass problems were used, however, these were quickly shown to be not well suited to detect any quantum speedup. Subsequently, benchmarking shifted to carefully crafted synthetic problems designed to highlight the quantum nature of the hardware while (often) ensuring that classical optimization techniques do not perform well on them. Even worse, to date a true sign of improved scaling with the number of problem variables remains elusive when compared to classical optimization techniques. Here, we analyze the readiness of quantum annealing machines for real-world application problems. These are typically not random and have an underlying structure that is hard to capture in synthetic benchmarks, thus posing unexpected challenges for optimization techniques, both classical and quantum alike. We present a comprehensive computational scaling analysis of fault diagnosis in digital circuits, considering architectures beyond D-wave quantum annealers. We find that the instances generated from real data in multiplier circuits are harder than other representative random spin-glass benchmarks with a comparable number of variables. Although our results show that transverse-field quantum annealing is outperformed by state-of-the-art classical optimization algorithms, these benchmark instances are hard and small in the size of the input, therefore representing the first industrial application ideally suited for testing near-term quantum annealers and other quantum algorithmic strategies for optimization problems.Comment: 22 pages, 12 figures. Content updated according to Phys. Rev. Applied versio

    High Performance Digital Circuit Techniques

    Get PDF
    Achieving high performance is one of the most difficult challenges in designing digital circuits. Flip-flops and adders are key blocks in most digital systems and must therefore be designed to yield highest performance. In this thesis, a new high performance serial adder is developed while power consumption is attained. Also, a statistical framework for the design of flip-flops is introduced that ensures that such sequential circuits meet timing yield under performance criteria. Firstly, a high performance serial adder is developed. The new adder is based on the idea of having a constant delay for the addition of two operands. While conventional adders exhibit logarithmic delay, the proposed adder works at a constant delay order. In addition, the new adder's hardware complexity is in a linear order with the word length, which consequently exhibits less area and power consumption as compared to conventional high performance adders. The thesis demonstrates the underlying algorithm used for the new adder and followed by simulation results. Secondly, this thesis presents a statistical framework for the design of flip-flops under process variations in order to maximize their timing yield. In nanometer CMOS technologies, process variations significantly impact the timing performance of sequential circuits which may eventually cause their malfunction. Therefore, developing a framework for designing such circuits is inevitable. Our framework generates the values of the nominal design parameters; i.e., the size of gates and transmission gates of flip-flop such that maximum timing yield is achieved for flip-flops. While previous works focused on improving the yield of flip-flops, less research was done to improve the timing yield in the presence of process variations

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    HEVC์™€ JPEG ํ•˜๋“œ์›จ์–ด ๋ถ€ํ˜ธํ™”๊ธฐ๋ฅผ ์œ„ํ•œ DCT์˜ Approximate Calculation

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 8. ์ดํ˜์žฌ.Discrete Cosine Transform (DCT) is widely used for various image and video compression applications because of its excellent energy compaction property. DCT is computationally intensive and the calculations are parallelizable. Therefore it is often implemented in hardware for speeding up the calculation. However due to large size of DCT or multiple modules of DCT required for some applications, the hardware area taken up by DCT in image or video encoders become significant. The DCT required in most applications doesnt need to be exact. Taking advantage of this fact, here a novel approach is provided to reduce the hardware area cost of the DCT module. The DCT hardware module consists of combinational logic and memory. Both the components are reduced and the complete implementation is described. The application being aimed at is for HEVC and JPEG, however the idea is applicable to any DCT hardware implementation. Finally the degradation caused to encoded image and video in terms of BDBR is discussed and the gate count results from the synthesis is provided.Chapter 1 Introduction 1 1.1 2D DCT Hardware Module . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Pipelining the process . . . . . . . . . . . . . . . . . . . . 5 1.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2 Related Works 9 Chapter 3 The Moving Window Idea for Bit-Width Reduction 12 3.1 ML Recovery for Moving Window . . . . . . . . . . . . . . . . . 16 Chapter 4 Approximate DCT for HEVC 19 4.1 HEVC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 DCT in HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Approximate DCT in HEVC . . . . . . . . . . . . . . . . . . . . 23 4.4.1 The three components of the DCT module . . . . . . . . 27 4.4.2 Optimizing Partial Butterfly Adder/Subtractors . . . . . 29 4.4.3 Optimizing the multiplication module . . . . . . . . . . . 30 4.4.3.1 Multiple Constant Multiplication (MCM) . . . . 32 4.4.3.2 Approximate MCM . . . . . . . . . . . . . . . . 32 4.4.4 Optimizing the transpose memory . . . . . . . . . . . . . 36 Chapter 5 Approximate DCT for JPEG 39 5.1 JPEG Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Application of Moving Window to DCT transpose memory . . . 42 5.3.1 Ideal implementation . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Window position based on first row . . . . . . . . . . . . . 43 5.3.2.1 Cases of failure . . . . . . . . . . . . . . . . . . . 46 5.3.3 Position based on first column . . . . . . . . . . . . . . . 48 5.3.3.1 Cases of failure . . . . . . . . . . . . . . . . . . . 49 5.4 Hybrid implementation . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 6 Experimental Results 54 6.1 HEVC Experiments and Results . . . . . . . . . . . . . . . . . . 55 6.2 JPEG Experiments and Results . . . . . . . . . . . . . . . . . . . 55 Chapter 7 Conclusion 64Maste

    Serial-data computation in VLSI

    Get PDF

    Profile-directed specialisation of custom floating-point hardware

    No full text
    We present a methodology for generating floating-point arithmetic hardware designs which are, for suitable applications, much reduced in size, while still retaining performance and IEEE-754 compliance. Our system uses three key parts: a profiling tool, a set of customisable floating-point units and a selection of system integration methods. We use a profiling tool for floating-point behaviour to identify arithmetic operations where fundamental elements of IEEE-754 floating-point may be compromised, without generating erroneous results in the common case. In the uncommon case, we use simple detection logic to determine when operands lie outside the range of capabilities of the optimised hardware. Out-of-range operations are handled by a separate, fully capable, floatingpoint implementation, either on-chip or by returning calculations to a host processor. We present methods of system integration to achieve this errorcorrection. Thus the system suffers no compromise in IEEE-754 compliance, even when the synthesised hardware would generate erroneous results. In particular, we identify from input operands the shift amounts required for input operand alignment and post-operation normalisation. For operations where these are small, we synthesise hardware with reduced-size barrel-shifters. We also propose optimisations to take advantage of other profile-exposed behaviours, including removing the hardware required to swap operands in a floating-point adder or subtractor, and reducing the exponent range to fit observed values. We present profiling results for a range of applications, including a selection of computational science programs, Spec FP 95 benchmarks and the FFMPEG media processing tool, indicating which would be amenable to our method. Selected applications which demonstrate potential for optimisation are then taken through to a hardware implementation. We show up to a 45% decrease in hardware size for a floating-point datapath, with a correctable error-rate of less then 3%, even with non-profiled datasets

    An instruction systolic array architecture for multiple neural network types

    Get PDF
    Modern electronic systems, especially sensor and imaging systems, are beginning to incorporate their own neural network subsystems. In order for these neural systems to learn in real-time they must be implemented using VLSI technology, with as much of the learning processes incorporated on-chip as is possible. The majority of current VLSI implementations literally implement a series of neural processing cells, which can be connected together in an arbitrary fashion. Many do not perform the entire neural learning process on-chip, instead relying on other external systems to carry out part of the computation requirements of the algorithm. The work presented here utilises two dimensional instruction systolic arrays in an attempt to define a general neural architecture which is closer to the biological basis of neural networks - it is the synapses themselves, rather than the neurons, that have dedicated processing units. A unified architecture is described which can be programmed at the microcode level in order to facilitate the processing of multiple neural network types. An essential part of neural network processing is the neuron activation function, which can range from a sequential algorithm to a discrete mathematical expression. The architecture presented can easily carry out the sequential functions, and introduces a fast method of mathematical approximation for the more complex functions. This can be evaluated on-chip, thus implementing the entire neural process within a single system. VHDL circuit descriptions for the chip have been generated, and the systolic processing algorithms and associated microcode instruction set for three different neural paradigms have been designed. A software simulator of the architecture has been written, giving results for several common applications in the field

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within
    • โ€ฆ
    corecore