61 research outputs found

    Implementations of high performance architecture for IEEE 754 compliant floating-point adders

    Get PDF
    This thesis presents a direct iteration and implementation on a high per-formance architecture for IEEE 754 floating-point addition. This thesis improves on the previous architecture's implementation in a variety of sub-operations required for IEEE 754 floating-point addition, which are focused on directly improving critical path delay performance. A key element of this paper is the introduction of a flagged-prefix adder within the main carry-propagation path of an end-around-carry adder. It also provides detailed documentation for the design of IEEE 754 compliant floating-point adders. This is particularly emphasized for uncommon operations and control logic used throughout floating-point addition, including denormalized numbers and multi-precision logic. The full design for this architecture has support for binary16, binary32, and binary64 operations. The full extended range provided by denormalized IEEE 754 values is supported. It also has conversion support between IEEE 754 and two's complement integer values in either binary16, binary32, or binary64 precision. The performance comparisons shown are synthesis results in cmos32soi 32nm GF technology and ARM-based standard cells

    IEEE Compliant Double-Precision FPU and 64-bit ALU with Variable Latency Integer Divider

    Get PDF
    Together the arithmetic logic unit (ALU) and floating-point unit (FPU) perform all of the mathematical and logic operations of computer processors. Because they are used so prominently, they fall in the critical path of the central processing unit - often becoming the bottleneck, or limiting factor for performance. As such, the design of a high-speed ALU and FPU is vital to creating a processor capable of performing up to the demanding standards of today\u27s computer users. In this paper, both a 64-bit ALU and a 64-bit FPU are designed based on the reduced instruction set computer architecture. The ALU performs the four basic mathematical operations - addition, subtraction, multiplication and division - in both unsigned and two\u27s complement format, basic logic operations and shifting. The division algorithm is a novel approach, using a comparison multiples based SRT divider to create a variable latency integer divider. The floating-point unit performs the double-precision floating-point operations add, subtract, multiply and divide, in accordance with the IEEE 754 standard for number representation and rounding. The ALU and FPU were implemented in VHDL, simulated in ModelSim, and constrained and synthesized using Synopsys Design Compiler (2006.06). They were synthesized using TSMC 0.1 3nm CMOS technology. The timing, power and area synthesis results were recorded, and, where applicable, compared to those of the corresponding DesignWare components.The ALU synthesis reported an area of 122,215 gates, a power of 384 mW, and a delay of 2.89 ns - a frequency of 346 MHz. The FPU synthesis reported an area 84,440 gates, a delay of 2.82 ns and an operating frequency of 355 MHz. It has a maximum dynamic power of 153.9 mW

    Measuring Improvement when Using HUB Formats to Implement Floating-Point Systems under Round-to-Nearest

    Get PDF
    MEC bajo TIN2013-42253-PThis paper analyzes the benefits of using HUB formats to implement floating-point arithmetic under round-tonearest mode from a quantitative point of view. Using HUB formats to represent numbers allows the removal of the rounding logic of arithmetic units, including sticky-bit computation. This is shown for floating-point adders, multipliers, and converters. Experimental analysis demonstrates that HUB formats and the corresponding arithmetic units maintain the same accuracy as conventional ones. On the other hand, the implementation of these units, based on basic architectures, shows that HUB formats simultaneously improve area, speed, and power consumption. Specifically, based on data obtained from the synthesis, a HUB single-precision adder is about 14% faster but consumes 38% less area and 26% less power than the conventional adder. Similarly, a HUB single-precision multiplier is 17% faster, uses 22% less area, and consumes slightly less power than conventional multiplier. At the same speed, the adder and multiplier achieve area and power reductions of up to 50% and 40%, respectively

    Approximate logic circuits: Theory and applications

    Get PDF
    CMOS technology scaling, the process of shrinking transistor dimensions based on Moore's law, has been the thrust behind increasingly powerful integrated circuits for over half a century. As dimensions are scaled to few tens of nanometers, process and environmental variations can significantly alter transistor characteristics, thus degrading reliability and reducing performance gains in CMOS designs with technology scaling. Although design solutions proposed in recent years to improve reliability of CMOS designs are power-efficient, the performance penalty associated with these solutions further reduces performance gains with technology scaling, and hence these solutions are not well-suited for high-performance designs. This thesis proposes approximate logic circuits as a new logic synthesis paradigm for reliable, high-performance computing systems. Given a specification, an approximate logic circuit is functionally equivalent to the given specification for a "significant" portion of the input space, but has a smaller delay and power as compared to a circuit implementation of the original specification. This contributions of this thesis include (i) a general theory of approximation and efficient algorithms for automated synthesis of approximations for unrestricted random logic circuits, (ii) logic design solutions based on approximate circuits to improve reliability of designs with negligible performance penalty, and (iii) efficient decomposition algorithms based on approxiiii mate circuits to improve performance of designs during logic synthesis. This thesis concludes with other potential applications of approximate circuits and identifies. open problems in logic decomposition and approximate circuit synthesis

    A Structured Design Methodology for High Performance VLSI Arrays

    Get PDF
    abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.Dissertation/ThesisPh.D. Electrical Engineering 201

    ARITHMETIC LOGIC UNIT ARCHITECTURES WITH DYNAMICALLY DEFINED PRECISION

    Get PDF
    Modern central processing units (CPUs) employ arithmetic logic units (ALUs) that support statically defined precisions, often adhering to industry standards. Although CPU manufacturers highly optimize their ALUs, industry standard precisions embody accuracy and performance compromises for general purpose deployment. Hence, optimizing ALU precision holds great potential for improving speed and energy efficiency. Previous research on multiple precision ALUs focused on predefined, static precisions. Little previous work addressed ALU architectures with customized, dynamically defined precision. This dissertation presents approaches for developing dynamic precision ALU architectures for both fixed-point and floating-point to enable better performance, energy efficiency, and numeric accuracy. These new architectures enable dynamically defined precision, including support for vectorization. The new architectures also prevent performance and energy loss due to applying unnecessarily high precision on computations, which often happens with statically defined standard precisions. The new ALU architectures support different precisions through the use of configurable sub-blocks, with this dissertation including demonstration implementations for floating point adder, multiply, and fused multiply-add (FMA) circuits with 4-bit sub-blocks. For these circuits, the dynamic precision ALU speed is nearly the same as traditional ALU approaches, although the dynamic precision ALU is nearly twice as large

    High Performance Reliable Variable Latency Carry Select Addition

    Get PDF
    This thesis describes the design and the optimization of a low overhead, high performance variable latency carry select adder. Previous researchers believed that the traditional adder has reached the theoretical speed bound. However, a considerable portion of hardware resources of the traditional adder is only used in the worst case. Based on this observation, variable latency adders have been proposed to improve on the theoretical limit, but such adders incur significant area overhead. By combining previous variable latency adders with carry select addition, this work describes a novel variable latency carry select adder. Applying carry select addition in the variable latency adder design significantly reduces the area overhead and increases its performance. This variable latency adder is faster and smaller than previous variable latency adders. Furthermore, this variable latency adder can be optimized to be faster and smaller than the fastest adder generated by the Synopsys DesignWare building block IP

    VLSI Circuits for Approximate Computing

    Get PDF
    Approximate Computing has recently emerged as a promising solution to enhance circuits performance by relaxing the requisite on exact calculations. Multimedia and Machine Learning constitute a typical example of error resilient, albeit compute-intensive, applications. In this dissertation, the design and optimization of approximate fundamental VLSI digital blocks is investigated. In chapter one the theoretical motivations of Approximate Computing, from the VLSI perspective, are discussed. In chapter two my research activity about approximate adders is reported. In this chapter approximate adders for both traditional non-error tolerant applications and error resilient applications are discussed. In chapter three precision-scalable units are investigated. Real-time precision scalability allows adapting the precision level of the unit with the precision requirements of the applications. In this context my research activities regarding approximate Multiply-and-Accumulate and memory units are described. In chapter four a precision-scalable approximate convolver for computer vision applications is discussed. This is composed of both the approximate Multiply-and-Accumulate and memory units, presented in the chapter three

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    HEVC와 JPEG 하드웨어 부호화기를 위한 DCT의 Approximate Calculation

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 이혁재.Discrete Cosine Transform (DCT) is widely used for various image and video compression applications because of its excellent energy compaction property. DCT is computationally intensive and the calculations are parallelizable. Therefore it is often implemented in hardware for speeding up the calculation. However due to large size of DCT or multiple modules of DCT required for some applications, the hardware area taken up by DCT in image or video encoders become significant. The DCT required in most applications doesnt need to be exact. Taking advantage of this fact, here a novel approach is provided to reduce the hardware area cost of the DCT module. The DCT hardware module consists of combinational logic and memory. Both the components are reduced and the complete implementation is described. The application being aimed at is for HEVC and JPEG, however the idea is applicable to any DCT hardware implementation. Finally the degradation caused to encoded image and video in terms of BDBR is discussed and the gate count results from the synthesis is provided.Chapter 1 Introduction 1 1.1 2D DCT Hardware Module . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Pipelining the process . . . . . . . . . . . . . . . . . . . . 5 1.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2 Related Works 9 Chapter 3 The Moving Window Idea for Bit-Width Reduction 12 3.1 ML Recovery for Moving Window . . . . . . . . . . . . . . . . . 16 Chapter 4 Approximate DCT for HEVC 19 4.1 HEVC Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 DCT in HEVC Encoder . . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Approximate DCT in HEVC . . . . . . . . . . . . . . . . . . . . 23 4.4.1 The three components of the DCT module . . . . . . . . 27 4.4.2 Optimizing Partial Butterfly Adder/Subtractors . . . . . 29 4.4.3 Optimizing the multiplication module . . . . . . . . . . . 30 4.4.3.1 Multiple Constant Multiplication (MCM) . . . . 32 4.4.3.2 Approximate MCM . . . . . . . . . . . . . . . . 32 4.4.4 Optimizing the transpose memory . . . . . . . . . . . . . 36 Chapter 5 Approximate DCT for JPEG 39 5.1 JPEG Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Approximate DCT . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3 Application of Moving Window to DCT transpose memory . . . 42 5.3.1 Ideal implementation . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Window position based on first row . . . . . . . . . . . . . 43 5.3.2.1 Cases of failure . . . . . . . . . . . . . . . . . . . 46 5.3.3 Position based on first column . . . . . . . . . . . . . . . 48 5.3.3.1 Cases of failure . . . . . . . . . . . . . . . . . . . 49 5.4 Hybrid implementation . . . . . . . . . . . . . . . . . . . . . . . . 50 Chapter 6 Experimental Results 54 6.1 HEVC Experiments and Results . . . . . . . . . . . . . . . . . . 55 6.2 JPEG Experiments and Results . . . . . . . . . . . . . . . . . . . 55 Chapter 7 Conclusion 64Maste
    corecore