Search CORE

7 research outputs found

Recommended from our members

Advanced microarchitecture and circuit design techniques for on-chip memories in CMOS technology

Author: Hsu Steven K.
Publication venue: 'Oregon State University'
Publication date
Field of study

In modern on-chip memories, an increasing demand for higher performance, lower power, reduced area, and improved robustness creates a rising need for advanced microarchitecture and circuit design techniques. Particularly in large-signal multi-ported register files, these advanced design techniques include: (i) multi-banked arrays, (ii) multi-frequency arrays, (iii) multi-bit width gating, (iv) multi-latency cycle times, (v) multi-threshold devices, and (vi) multi-strength keepers. In modern microprocessors, register files are important ingredients, but the increasing number of register file read/write ports and entries can produce a bottleneck. This thesis discusses various new techniques, to address the challenges facing register file designers, and to satisfy microprocessor requirements. The scalability of register files is a concern in modern microprocessors. As microprocessors become wider to exploit instruction level parallelism, this increases the amount of read/write ports. In turn this results in quadratic growth in register file area, decreasing frequency and increasing the power consumption. Multi-banked and multi-frequency register files reduce area and power consumption by relieving the read/write port congestion. Multi-bit width register files reduce active power during read/write operations by gating the clock/wordline. Pipelined register files improve frequency by reducing logic depth, but require multiple cycles for read/write operations. Multi-latency register files contain variable access cycle times, which are dependent on the physical locality of the data. This improves overall microprocessor performance and recovers lost instructions per cycle. As instruction window size continues to expand in modern microprocessors, the resulting demand for additional register file entries requires increased use of wide-OR dynamic circuits. However, these circuits, primarily found in local/global bitlines, are susceptible to leakage noise. In a multi-threshold process, a self-reverse bias technique exploits the use of leaky low-VTH devices, reducing bitline leakage and improving robustness. This circuit topology improves bitline delay from reduced keeper contention. Downsized keepers improve bitline delay in low leakage conditions; stronger keepers improve bitline robustness in high leakage conditions. Utilizing this concept, register files with multi-strength keepers enable robust operation across a wide range of process, voltage, and temperature. These various design techniques show excellent promise in improving performance, power, area, and robustness of multi-ported register files in modern microprocessors

ScholarsArchive@OSU

Tertiary-Tree 12-GHz 32-bit Adder in 65nm Technology

Author: Agah Amir
Emami-Neyestanak Azita
Fakhraie S. Mehdi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2007
Field of study

This paper presents a new 32-bit adder structure with 12 GHz low-power operation in 65nm technology. The Fast Conditional Sparse-Tree Logic (FCSL) is based on modifying the initial Sparse-Tree architecture [1] to enhance its speed using tertiary trees and applying a carry-select scheme in some of the more significant bits. This design has been compared with the Sparse-Tree adder and the Low-Voltage Swing adder in terms of speed and power. It has been shown that speed can be improved using FCSL architecture while keeping the power at a comparable level

Crossref

Caltech Authors

A Constant Delay Logic Style - An Alternative Way of Logic Design

Author: Chuang Pierce I-Jen
Publication venue: 'University of Waterloo'
Publication date: 01/01/2010
Field of study

High performance, energy efficient logic style has always been a popular research topic in the field of very large scale integrated (VLSI) circuits because of the continuous demands of ever increasing circuit operating frequency. The invention of the dynamic logic in the 80s is one of the answers to this request as it allows designers to implement high performance circuit block, i.e., arithmetic logic unit (ALU), at an operating frequency that traditional static and pass transistor CMOS logic styles are difficult to achieve. However, the performance enhancement comes with several costs, including reduced noise margin,charge-sharing noise, and higher power dissipation due to higher data activity. Furthermore, dynamic logic has gradually lost its performance advantage over static logic due to the increased self-loading ratio in deep-submicron technology (65nm and below) because of the additional NMOS CLK footer transistor. Because of dynamic logic's limitations and diminished speed reward, a slowly rising need has emerged in the past decade to explore new logic style that goes beyond dynamic logic. In this thesis a constant delay (CD) logic style is proposed. The constant delay characteristic of this logic style regardless of the logic expression makes it suitable in implementing complicated logic expression such as addition. Moreover, CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage is ready. This feature enables performance advantage over static and dynamic logic styles in a single cycle, multi-stage circuit block. Several design considerations including appropriate timing window width adjustment to reduce power consumption and maintain sufficient noise margin to ensure robust operations are discussed and analyzed. Using 65nm general purpose CMOS technology, the proposed logic demonstrates an average speed up of 94% and 56% over static and dynamic logic respectively in five different logic expressions. Post layout simulation results of 8-bit ripple carry adders conclude that CD-based design is 39% and 23% faster than the static and dynamic-based adders respectively. For ultra-high speed applications, CD-based design exhibits improved energy, power-delay product, and energy-delay product efficiency compared to static and dynamic counterparts

University of Waterloo's Institutional Repository

Circuit Techniques for Adaptive and Reliable High Performance Computing.

Author: Giridhar Bharan
Publication venue
Publication date
Field of study

Increasing power density with process scaling has caused stagnation in the clock speed of modern microprocessors. Accordingly, designers have adopted message passing and shared memory based multicore architectures in order to keep up with the rapidly rising demand for computing throughput. At the same time, applications are not entirely parallel and improving single-thread performance continues to remain critical. Additionally, reliability is also worsening with process scaling, and margining for failures due to process and environmental variations in modern technologies consumes an increasingly large portion of the power/performance envelope. In the wake of multicore computing, reliability of signal synchronization between the cores is also becoming increasingly critical. This forces designers to search for alternate efficient methods to improve compute performance while addressing reliability. Accordingly, this dissertation presents innovative circuit and architectural techniques for variation-tolerance, performance and reliability targeted at datapath logic, signal synchronization and memories. Firstly, a domino logic based design style for datapath logic is presented that uses Adaptive Robustness Tuning (ART) in addition to timing speculation to provide up to 71% performance gains over conventional domino logic in 32bx32b multiplier in 65nm CMOS. Margins are reduced until functionality errors are detected, that are used to guide the tuning. Secondly, for signal synchronization across clock domains, a new class of dynamic logic based synchronizers with single-cycle synchronization latency is presented, where pulses, rather than stable intermediate voltages cause metastability. Such pulses are amplified using skewed inverters to improve mean time between failures by ~1e6x over jamb latches and double flip-flops at 2GHz in 65nm CMOS. Thirdly, a reconfigurable sensing scheme for 6T SRAMs is presented that employs auto-zero calibration and pre-amplification to improve sensing reliability (by up to 1.2 standard deviations of NMOS threshold voltage in 28nm CMOS); this increased reliability is in turn traded for ~42% sensing speedup. Finally, a main memory architecture design methodology to address reliability and power in the context of Exascale computing systems is presented. Based on 3D-stacked DRAMs, the methodology co-optimizes DRAM access energy, refresh power and the increased cost of error resilience, to meet stringent power and reliability constraints.PhDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/107238/1/bharan_1.pd

Deep Blue Documents at the University of Michigan

High-Performance, Energy-Efficient CMOS Arithmetic Circuits

Author: Chuang Pierce I-Jen
Publication venue: 'University of Waterloo'
Publication date: 01/01/2014
Field of study

In a modern microprocessor, datapath/arithmetic circuits have always been an important building block in delivering high-performance, energy-efficient computing, because arithmetic operations such as addition and binary number comparison are two of the most commonly used computing instructions. Besides the manufacturing CMOS process, the two most critical design considerations for arithmetic circuits are the logic style and micro-architecture. In this thesis, a constant-delay (CD) logic style is proposed targeting full-custom high-speed applications. The constant delay characteristic of this logic style (regardless of the logic type) makes it suitable for implementing complicated logic expressions such as addition. CD logic exhibits a unique characteristic where the output is pre-evaluated before the inputs from the preceding stage are ready. This feature enables a performance advantage over static and dynamic domino logic styles in a single cycle, multi-stage circuit block. Several design considerations including timing window width adjustment and clock distribution are discussed. Using a 65-nm general-purpose CMOS technology, the proposed logic style demonstrates an average speedup of 94% and 56% over static and dynamic domino logic, respectively, in five different logic gates. Simulation results of 8-bit ripple carry adders conclude that CD logic is 39% and 23% faster than the static and dynamic-based adders, respectively. CD logic also demonstrates 39% speedup and 64% (22%) energy-delay product reduction from static logic at 100% (10%) data activity in 32-bit carry lookahead adders. To confirm CD logic's potential, a 148 ps, single-cycle 64-bit adder with CD logic implemented in the critical path is fabricated in a 65-nm, 1-V CMOS process. A new 64-bit Ling adder micro-architecture, which utilizes both inversion and absorption properties to minimize the number of CD logic and the number of logic stage in the critical path, is also proposed. At 1-V supply, this adder's measured worst-case power and leakage power are 135 mW and 0.22 mW, respectively. A single-cycle 64-bit binary comparator utilizing a radix-2 tree structure is also proposed. This comparator architecture is specifically designed for static logic to achieve both low-power and high-performance operation, especially in low input data activity environments. At 65-nm technology with 25% (10%) data activity, the proposed design demonstrates 2.3x (3.5x) and 3.7x (5.8x) power and energy-delay product efficiency, respectively. This comparator is also 2.7x faster at iso-energy (80 fJ) or 3.3x more energy-efficient at iso-delay (200 ps) than existing designs. An improved comparator, where CD logic is utilized in the critical path to achieve high performance without sacrificing the overall energy efficiency, is also realized in a 65-nm 1-V CMOS process. At 1-V supply, the proposed comparator's measured delay is 167 ps, and has an average power and a leakage power of 2.34 mW and 0.06 mW, respectively. At 0.3-pJ iso-energy or 250-ps iso-delay budget, the proposed comparator with CD logic is 20% faster or 17% more energy-efficient compared to a comparator implemented with just the static logic

University of Waterloo's Institutional Repository

Precise Timing of Digital Signals: Circuits and Applications

Author: Nummer Muhammad
Publication venue: 'University of Waterloo'
Publication date: 01/06/2007
Field of study

With the rapid advances in process technologies, the performance of state-of-the-art integrated circuits is improving steadily. The drive for higher performance is accompanied with increased emphasis on meeting timing constraints not only at the design phase but during device operation as well. Fortunately, technology advancements allow for even more precise control of the timing of digital signals, an advantage which can be used to provide solutions that can address some of the emerging timing issues. In this thesis, circuit and architectural techniques for the precise timing of digital signals are explored. These techniques are demonstrated in applications addressing timing issues in modern digital systems. A methodology for slow-speed timing characterization of high-speed pipelined datapaths is proposed. The technique uses a clock-timing circuit to create shifted versions of a slow-speed clock. These clocks control the data flow in the pipeline in the test mode. Test results show that the design provides an average timing resolution of 52.9ps in 0.18μm CMOS technology. Results also demonstrate the ability of the technique to track the performance of high-speed pipelines at a reduced clock frequency and to test the clock-timing circuit itself. In order to achieve higher resolutions than that of an inverter/buffer stage, a differential (vernier) delay line is commonly used. To allow for the design of differential delay lines with programmable delays, a digitally-controlled delay-element is proposed. The delay element is monotonic and achieves a high degree of transfer characteristics' (digital code vs. delay) linearity. Using the proposed delay element, a sub-1ps resolution is demonstrated experimentally in 0.18μm CMOS. The proposed delay element with a fixed delay step of 2ps is used to design a high-precision all-digital phase aligner. High-precision phase alignment has many applications in modern digital systems such as high-speed memory controllers, clock-deskew buffers, and delay and phase-locked loops. The design is based on a differential delay line and a variation tolerant phase detector using redundancy. Experimental results show that the phase aligner's range is from -264ps to +247ps which corresponds to an average delay step of approximately 2.43ps. For various input phase difference values, test results show that the difference is reduced to less than 2ps at the output of the phase aligner. On-chip time measurement is another application that requires precise timing. It has applications in modern automatic test equipment and on-chip characterization of jitter and skew. In order to achieve small conversion time, a flash time-to-digital converter is proposed. Mismatch between the various delay comparators limits the time measurement precision. This is demonstrated through an experiment in which a 6-bit, 2.5ps resolution flash time-to-digital converter provides an effective resolution of only 4-bits. The converter achieves a maximum conversion rate of 1.25GSa/s

University of Waterloo's Institutional Repository

Monolithic electronic-photonic integration in state-of-the-art CMOS processes

Author: Orcutt Jason S. (Jason Scott)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 388-407).As silicon CMOS transistors have scaled, increasing the density and energy efficiency of computation on a single chip, the off-chip communication link to memory has emerged as the major bottleneck within modern processors. Photonic devices promise to break this bottleneck with superior bandwidth-density and energy-efficiency. Initial work by many research groups to adapt photonic device designs to a silicon-based material platform demonstrated suitable independent performance for such links. However, electronic-photonic integration attempts to date have been limited by the high cost and complexity associated with modifying CMOS platforms suitable for modern high-performance computing applications. In this work, we instead utilize existing state-of-the-art electronic CMOS processes to fabricate integrated photonics by: modifying designs to match the existing process; preparing a design-rule compliant layout within industry-standard CAD tools; and locally-removing the handle silicon substrate in the photonic region through post-processing. This effort has resulted in the fabrication of seven test chips from two major foundries in 28, 45, 65 and 90 nm CMOS processes. Of these efforts, a single die fabricated through a widely available 45nm SOI-CMOS mask-share foundry with integrated waveguides with 3.7 dB/cm propagation loss alongside unmodified electronics with less than 5 ps inverter stage delay serves as a proof-of-concept for this approach. Demonstrated photonic devices include high-extinction carrier-injection modulators, 8-channel wavelength division multiplexing filter banks and low-efficiency silicon germanium photodetectors. Simultaneous electronic-photonic functionality is verified by recording a 600 Mb/s eye diagram from a resonant modulator driven by integrated digital circuits. Initial work towards photonic device integration within the peripheral CMOS flow of a memory process that has resulted in polysilicon waveguide propagation losses of 6.4 dB/cm will also be presented.by Jason S. Orcutt.Ph.D

DSpace@MIT