26 research outputs found

    Self-Heating Aware Design of ICs in Deep Sub-Micron FDSOI and Bulk Technologies

    Get PDF
    Bulk CMOS technologies left the semiconductor market to the novel device geometries such as FDSOI and FinFET below 30 nm, mainly due to their insufficient electrical characteristics arising from different physical limitations. These innovative solutions enabled the ongoing device scaling to continue. However, the threshold voltage and the power supply values did not shrink with the device sizes, which caused an excessive amount of heat generation in very small dimensions. With the high thermal resistivity materials used in FDSOI and FinFET, the generated heat cannot leave the device easily, which is not the case in bulk. With all of these, modern geometries brought a major problem, which is the self-heating. Due to self-heating effects (SHE), the temperature of a device rises significantly compared to its surroundings. Having very large local temperature brings important reliability issues. Moreover, the electrical behaviour of a device also changes dramatically when its temperature is very large. These facts bring the need of considering SHE and the temperature of each device separately. Nevertheless, in many of today's CAD tools, a single global temperature is applied to all of the devices. Even if some advanced simulation options are used, estimating the temperature of a device is not a simple task as it depends on many parameters. The focus of this thesis is to show the significance of SHE in the design of ICs and provide self-heating aware design guidelines. In order to achieve this, different circuit implementations are studied by considering the SHE. The study consists of two main parts, which are the reliability of the high-speed digital circuits and the performance of analog blocks where noise is critical. Moreover, detailed device-level electro-thermal simulations are performed to explain the self-heating phenomena more in detail and to perform a comparison between bulk and FDSOI. The digital part of the self-heating study is performed on two very high-speed full-custom 64-bit Kogge-Stone adders in 40 nm and 28 nm technologies. Thermal simulations are performed on these blocks to compare SHE in bulk and FDSOI geometries. The comparison of two implementations also provides the increasing significance of SHE with scaling. Extensive heating analyses are performed to find the most critical devices that are the primary heat generators. Design guidelines and solutions are proposed to flatten the temperature profiles in precharged and static logic implementations and to decrease the probability of electromigration. The analog study of the work focuses on the thermal noise performance of LNAs and SHE on the flicker noise. Since thermal noise of a device linearly depends on the temperature, it is directly affected by SHE. To show the amount of SHE on the noise figure, three common gate cascode LNAs operating at 2 GHz with different device lengths are implemented in 28 nm FDSOI. The measurements show that the self-heating effects are clearly observed on the noise figure and the performance of the blocks deviate importantly from the simulations. Moreover, the self-heating effects are significantly more in short channel devices due to their large heat density. Similar experiments are also performed on different test structures in FDSOI at lower frequencies to observe SHE on flicker noise. The experiments show that flicker noise degrades at larger temperatures and more in short channel implementations

    Design of Readout Electronics for the DEEP Particle Detector

    Get PDF
    Along with electromagnetic radiation, the Sun also emits a constant stream of charged particles in the form of solar wind. When these particles enter Earth’s atmosphere through a process known as particle precipitation, they can through a series of chemical reactions produce N Ox and HOx gases. These gases are greenhouse gases and deplete the ozone in the mesosphere and upper stratosphere. It is important to quantify the rate of production of these gases to model the potential climate impact. Existing particle detectors in space are suboptimal because they cannot determine the energy flux and pitch angle distribution of precipitating particles. The primary scientific objective of the DEEP project is to design a particle detector instrument that is specifically designed for particle precipitation measurements. This thesis investigates different data acquisition schemes for handling the signal from a pixel detector. The chosen approach is measuring the width of a shaped pulse to quantify the energy of the particle. Known as Time-over-Threshold, a detector circuit board is designed featuring high-speed comparators as threshold discriminators and the NG-MEDIUM FPGA from NanoXplore to implement the data acquisition. Digitizing the comparator pulse width is done with a Time-to-Digital converter (TDC) implemented in the FPGA fabric. Since the difference in pulse width is small for different energies, a high conversion resolution is required. Two high-resolution TDCs are designed and compared, both of which feature a digital counter and a method of interpolating the counter clock period. The first interpolation method applies the use of a multitapped delay line implemented with hard carry chain resources, and the second method oversamples the input with several equally off-phase sampling clocks. A resolution of 302 ps and a differential non-linearity of 3.26 was achieved with the delay line TDC clocked at 100 MHz. An automatic statistical calibration scheme is included to determine the actual delays of the delay line, utilizing a second asynchronous clock to generate uniformly distributed hits. The asynchronous oversampler resolution is clock frequency dependent and provides a 4-fold improvement to the clock period. The differential nonlinearity approaches zero with close matching of the off-phase clocks and operating frequency. A complete firmware design for the data acquisition and rocket telemetry of the detector is proposed and demonstrated. A simulation of the firmware utilizing each TDC topology is conducted and the delay line TDC is demonstrated to be the most accurate at all operating frequencies and thus the recommended TDC for the DEEP data acquisition.Masteroppgave i fysikkPHYS399MAMN-PHY

    Timing-Error Tolerance Techniques for Low-Power DSP: Filters and Transforms

    Get PDF
    Low-power Digital Signal Processing (DSP) circuits are critical to commercial System-on-Chip design for battery powered devices. Dynamic Voltage Scaling (DVS) of digital circuits can reclaim worst-case supply voltage margins for delay variation, reducing power consumption. However, removing static margins without compromising robustness is tremendously challenging, especially in an era of escalating reliability concerns due to continued process scaling. The Razor DVS scheme addresses these concerns, by ensuring robustness using explicit timing-error detection and correction circuits. Nonetheless, the design of low-complexity and low-power error correction is often challenging. In this thesis, the Razor framework is applied to fixed-precision DSP filters and transforms. The inherent error tolerance of many DSP algorithms is exploited to achieve very low-overhead error correction. Novel error correction schemes for DSP datapaths are proposed, with very low-overhead circuit realisations. Two new approximate error correction approaches are proposed. The first is based on an adapted sum-of-products form that prevents errors in intermediate results reaching the output, while the second approach forces errors to occur only in less significant bits of each result by shaping the critical path distribution. A third approach is described that achieves exact error correction using time borrowing techniques on critical paths. Unlike previously published approaches, all three proposed are suitable for high clock frequency implementations, as demonstrated with fully placed and routed FIR, FFT and DCT implementations in 90nm and 32nm CMOS. Design issues and theoretical modelling are presented for each approach, along with SPICE simulation results demonstrating power savings of 21 – 29%. Finally, the design of a baseband transmitter in 32nm CMOS for the Spectrally Efficient FDM (SEFDM) system is presented. SEFDM systems offer bandwidth savings compared to Orthogonal FDM (OFDM), at the cost of increased complexity and power consumption, which is quantified with the first VLSI architecture

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    Approximate logic circuits: Theory and applications

    Get PDF
    CMOS technology scaling, the process of shrinking transistor dimensions based on Moore's law, has been the thrust behind increasingly powerful integrated circuits for over half a century. As dimensions are scaled to few tens of nanometers, process and environmental variations can significantly alter transistor characteristics, thus degrading reliability and reducing performance gains in CMOS designs with technology scaling. Although design solutions proposed in recent years to improve reliability of CMOS designs are power-efficient, the performance penalty associated with these solutions further reduces performance gains with technology scaling, and hence these solutions are not well-suited for high-performance designs. This thesis proposes approximate logic circuits as a new logic synthesis paradigm for reliable, high-performance computing systems. Given a specification, an approximate logic circuit is functionally equivalent to the given specification for a "significant" portion of the input space, but has a smaller delay and power as compared to a circuit implementation of the original specification. This contributions of this thesis include (i) a general theory of approximation and efficient algorithms for automated synthesis of approximations for unrestricted random logic circuits, (ii) logic design solutions based on approximate circuits to improve reliability of designs with negligible performance penalty, and (iii) efficient decomposition algorithms based on approxiiii mate circuits to improve performance of designs during logic synthesis. This thesis concludes with other potential applications of approximate circuits and identifies. open problems in logic decomposition and approximate circuit synthesis

    Design of Soft Error Robust High Speed 64-bit Logarithmic Adder

    Get PDF
    Continuous scaling of the transistor size and reduction of the operating voltage have led to a significant performance improvement of integrated circuits. However, the vulnerability of the scaled circuits to transient data upsets or soft errors, which are caused by alpha particles and cosmic neutrons, has emerged as a major reliability concern. In this thesis, we have investigated the effects of soft errors in combinational circuits and proposed soft error detection techniques for high speed adders. In particular, we have proposed an area-efficient 64-bit soft error robust logarithmic adder (SRA). The adder employs the carry merge Sklansky adder architecture in which carries are generated every 4 bits. Since the particle-induced transient, which is often referred to as a single event transient (SET) typically lasts for 100~200 ps, the adder uses time redundancy by sampling the sum outputs twice. The sampling instances have been set at 110 ps apart. In contrast to the traditional time redundancy, which requires two clock cycles to generate a given output, the SRA generates an output in a single clock cycle. The sampled sum outputs are compared using a 64-bit XOR tree to detect any possible error. An energy efficient 4-input transmission gate based XOR logic is implemented to reduce the delay and the power in this case. The pseudo-static logic (PSL), which has the ability to recover from a particle induced transient, is used in the adder implementation. In comparison with the space redundant approach which requires hardware duplication for error detection, the SRA is 50% more area efficient. The proposed SRA is simulated for different operands with errors inserted at different nodes at the inputs, the carry merge tree, and the sum generation circuit. The simulation vectors are carefully chosen such that the SET is not masked by error masking mechanisms, which are inherently present in combinational circuits. Simulation results show that the proposed SRA is capable of detecting 77% of the errors. The undetected errors primarily result when the SET causes an even number of errors and when errors occur outside the sampling window

    Circuit Design, Architecture and CAD for RRAM-based FPGAs

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have been indispensable components of embedded systems and datacenter infrastructures. However, energy efficiency of FPGAs has become a hard barrier preventing their expansion to more application contexts, due to two physical limitations: (1) The massive usage of routing multiplexers causes delay and power overheads as compared to ASICs. To reduce their power consumption, FPGAs have to operate at low supply voltage but sacrifice performance because the transistors drive degrade when working voltage decreases. (2) Using volatile memory technology forces FPGAs to lose configurations when powered off and to be reconfigured at each power on. Resistive Random Access Memories (RRAMs) have strong potentials in overcoming the physical limitations of conventional FPGAs. First of all, RRAMs grant FPGAs non-volatility, enabling FPGAs to be "Normally powered off, Instantly powered on". Second, by combining functionality of memory and pass-gate logic in one unique device, RRAMs can greatly reduce area and delay of routing elements. Third, when RRAMs are embedded into datpaths, the performance of circuits can be independent from their working voltage, beyond the limitations of CMOS circuits. However, researches and development of RRAM-based FPGAs are in their infancy. Most of area and performance predictions were achieved without solid circuit-level simulations and sophisticated Computer Aided Design (CAD) tools, causing the predicted improvements to be less convincing. In this thesis,we present high-performance and low-power RRAM-based FPGAs fromtransistorlevel circuit designs to architecture-level optimizations and CAD tools, using theoretical analysis, industrial electrical simulators and novel CAD tools. We believe that this is the first systematic study in the field, covering: From a circuit design perspective, we propose efficient RRAM-based programming circuits and routing multiplexers through both theoretical analysis and electrical simulations. The proposed 4T(ransitor)1R(RAM) programming structure demonstrates significant improvements in programming current, when compared to most popular 2T1R programming structure. 4T1R-based routingmultiplexer designs are proposed by considering various physical design parasitics, such as intrinsic capacitance of RRAMs and wells doping organization. The proposed 4T1R-based multiplexers outperformbest CMOS implementations significantly in area, delay and power at both nominal and near-Vt regime. From a CAD perspective, we develop a generic FPGA architecture exploration tool, FPGASPICE, modeling a full FPGA fabric with SPICE and Verilog netlists. FPGA-SPICE provides different levels of testbenches and techniques to split large SPICE netlists, in order to obtain better trade-off between simulation time and accuracy. FPGA-SPICE can capture area and power characteristics of SRAM-based and RRAM-based FPGAs more accurately than the currently best analyticalmodels. From an architecture perspective, we propose architecture-level optimizations for RRAMbased FPGAs and quantify their minimumrequirements for RRAM devices. Compared to the best SRAM-based FPGAs, an optimized RRAM-based FPGA architecture brings significant reduction in area, delay and power respectively. In particular, RRAM-based FPGAs operating in the near-Vt regime demonstrate a 5x power improvement without delay overhead as compared to optimized SRAM-based FPGA operating at nominal working voltage

    Novel ultra-energy-efficient reversible designs of sequential logic quantum-dot cellular automata flip-flop circuits

    Get PDF
    The version of record of this article, first published in [The Journal of Supercomputing], is available online at Publisher’s website: http://dx.doi.org/10.1007/s11227-023-05134-1Quantum-dot cellular automata (QCA) is a technological approach to implement digital circuits with exceptionally high integration density, high switching frequency, and low energy dissipation. QCA circuits are a potential solution to the energy dissipation issues created by shrinking microprocessors with ultra-high integration densities. Current QCA circuit designs are irreversible, yet reversible circuits are known to increase energy efficiency. Thus, the development of reversible QCA circuits will further reduce energy dissipation. This paper presents novel reversible and irreversible sequential QCA set/reset (SR), data (D), Jack Kilby (JK), and toggle (T) flip-flop designs based on the majority gate that utilizes the universal, standard, and efficient (USE) clocking scheme, which allows the implementation of feedback paths and easy routing for sequential QCA-based circuits. The simulation results confirm that the proposed reversible QCA USE sequential flip-flop circuits exhibit energy dissipation less than the Landauer energy limit. Irreversible QCA USE flip-flop designs, although having higher energy dissipation, sometimes have floorplan areas and delay times less than those of reversible designs; therefore, they are also explored. The trade-offs between the energy dissipation versus the area cost and delay time for the reversible and irreversible QCA circuits are examined comprehensively

    GigaHertz Symposium 2010

    Get PDF

    Energy-precision tradeoffs in the graphics pipeline

    Get PDF
    The energy consumption of a graphics processing unit (GPU) is an important factor in its design, whether for a server, desktop, or mobile device. Mobile products, such as smart phones, tablets, and laptop computers, rely on batteries to function; the less the demand for power is on these batteries, the longer they will last before needing to be recharged. GPUs used in servers and desktops, while not dependent on a battery for operation, are still limited by the efficiency of power supplies and heat dissipation techniques. In this dissertation, I propose to lower the energy consumption of GPUs by reducing the precision of floating-point arithmetic in the graphics pipeline and the data sent and stored on- and off-chip. The key idea behind this work is twofold: energy can be saved through a systematic and targeted reduction in the number of bits 1) computed and 2) communicated. Reducing the number of bits computed will necessarily reduce either the precision or range of a floating point number. I focus on saving energy by way of reducing precision, which can exploit the over-provisioning of bits in many stages of the graphics pipeline. Reducing the number of bits communicated takes several forms. First, I propose enhancements to existing compression schemes for off-chip buffers to save bandwidth. I also suggest a simple extension that exploits unused bits in reduced-precision data undergoing compression. Finally, I present techniques for saving energy in on-chip communication of reduced-precision data. By designing and simulating variable-precision arithmetic circuits with promising energy versus precision characteristics and tradeoffs, I have developed an energy model for GPUs. Using this model and my techniques, I have shown that significant savings (up to 70% in computation in the vertex and pixel shader stages) are possible by reducing the precision of the arithmetic. Further, my compression approaches have enabled improvements of 1.26x over past work, and a general-purpose compressor design has achieved bandwidth savings of 34%, 87%, and 65% for color, depth, and geometry data, respectively, which is competitive with past work. Lastly, an initial exploration in signal gating unused lines in on-chip buses has suggested savings of 13-48% for the tested applications' traffic from a multiprocessor's register file to its L1 cache
    corecore