300 research outputs found

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    Design of Low-Voltage Digital Building Blocks and ADCs for Energy-Efficient Systems

    Get PDF
    Increasing number of energy-limited applications continue to drive the demand for designing systems with high energy efficiency. This tutorial covers the main building blocks of a system implementation including digital logic, embedded memories, and analog-to-digital converters and describes the challenges and solutions to designing these blocks for low-voltage operation

    Clock Generator Circuits for Low-Power Heterogeneous Multiprocessor Systems-on-Chip

    Get PDF
    In this work concepts and circuits for local clock generation in low-power heterogeneous multiprocessor systems-on-chip (MPSoCs) are researched and developed. The targeted systems feature a globally asynchronous locally synchronous (GALS) clocking architecture and advanced power management functionality, as for example fine-grained ultra-fast dynamic voltage and frequency scaling (DVFS). To enable this functionality compact clock generators with low chip area, low power consumption, wide output frequency range and the capability for ultra-fast frequency changes are required. They are to be instantiated individually per core. For this purpose compact all digital phase-locked loop (ADPLL) frequency synthesizers are developed. The bang-bang ADPLL architecture is analyzed using a numerical system model and optimized for low jitter accumulation. A 65nm CMOS ADPLL is implemented, featuring a novel active current bias circuit which compensates the supply voltage and temperature sensitivity of the digitally controlled oscillator (DCO) for reduced digital tuning effort. Additionally, a 28nm ADPLL with a new ultra-fast lock-in scheme based on single-shot phase synchronization is proposed. The core clock is generated by an open-loop method using phase-switching between multi-phase DCO clocks at a fixed frequency. This allows instantaneous core frequency changes for ultra-fast DVFS without re-locking the closed loop ADPLL. The sensitivity of the open-loop clock generator with respect to phase mismatch is analyzed analytically and a compensation technique by cross-coupled inverter buffers is proposed. The clock generators show small area (0.0097mm2 (65nm), 0.00234mm2 (28nm)), low power consumption (2.7mW (65nm), 0.64mW (28nm)) and they provide core clock frequencies from 83MHz to 666MHz which can be changed instantaneously. The jitter performance is compliant to DDR2/DDR3 memory interface specifications. Additionally, high-speed clocks for novel serial on-chip data transceivers are generated. The ADPLL circuits have been verified successfully by 3 testchip implementations. They enable efficient realization of future low-power MPSoCs with advanced power management functionality in deep-submicron CMOS technologies.In dieser Arbeit werden Konzepte und Schaltungen zur lokalen Takterzeugung in heterogenen Multiprozessorsystemen (MPSoCs) mit geringer Verlustleistung erforscht und entwickelt. Diese Systeme besitzen eine global-asynchrone lokal-synchrone Architektur sowie FunktionalitĂ€t zum Power Management, wie z.B. das feingranulare, schnelle Skalieren von Spannung und Taktfrequenz (DVFS). Um diese FunktionalitĂ€t zu realisieren werden kompakte Taktgeneratoren benötigt, welche eine kleine ChipflĂ€che einnehmen, wenig Verlustleitung aufnehmen, einen weiten Bereich an Ausgangsfrequenzen erzeugen und diese sehr schnell Ă€ndern können. Sie sollen individuell pro Prozessorkern integriert werden. Dazu werden kompakte volldigitale Phasenregelkreise (ADPLLs) entwickelt, wobei eine bang-bang ADPLL Architektur numerisch modelliert und fĂŒr kleine Jitterakkumulation optimiert wird. Es wird eine 65nm CMOS ADPLL implementiert, welche eine neuartige Kompensationsschlatung fĂŒr den digital gesteuerten Oszillator (DCO) zur Verringerung der SensitivitĂ€t bezĂŒglich Versorgungsspannung und Temperatur beinhaltet. ZusĂ€tzlich wird eine 28nm CMOS ADPLL mit einer neuen Technik zum schnellen Einschwingen unter Nutzung eines Phasensynchronisierers realisiert. Der Prozessortakt wird durch ein neuartiges Phasenmultiplex- und Frequenzteilerverfahren erzeugt, welches es ermöglicht die Taktfrequenz sofort zu Ă€ndern um schnelles DVFS zu realisieren. Die SensitivitĂ€t dieses Frequenzgenerators bezĂŒglich Phasen-Mismatch wird theoretisch analysiert und durch Verwendung von kreuzgekoppelten TaktverstĂ€rkern kompensiert. Die hier entwickelten Taktgeneratoren haben eine kleine ChipflĂ€che (0.0097mm2 (65nm), 0.00234mm2 (28nm)) und Leistungsaufnahme (2.7mW (65nm), 0.64mW (28nm)). Sie stellen Frequenzen von 83MHz bis 666MHz bereit, welche sofort geĂ€ndert werden können. Die Schaltungen erfĂŒllen die Jitterspezifikationen von DDR2/DDR3 Speicherinterfaces. ZusĂ€tzliche können schnelle Takte fĂŒr neuartige serielle on-Chip Verbindungen erzeugt werden. Die ADPLL Schaltungen wurden erfolgreich in 3 Testchips erprobt. Sie ermöglichen die effiziente Realisierung von zukĂŒnftigen MPSoCs mit Power Management in modernsten CMOS Technologien

    Cross-Layer Optimization for Power-Efficient and Robust Digital Circuits and Systems

    Full text link
    With the increasing digital services demand, performance and power-efficiency become vital requirements for digital circuits and systems. However, the enabling CMOS technology scaling has been facing significant challenges of device uncertainties, such as process, voltage, and temperature variations. To ensure system reliability, worst-case corner assumptions are usually made in each design level. However, the over-pessimistic worst-case margin leads to unnecessary power waste and performance loss as high as 2.2x. Since optimizations are traditionally confined to each specific level, those safe margins can hardly be properly exploited. To tackle the challenge, it is therefore advised in this Ph.D. thesis to perform a cross-layer optimization for digital signal processing circuits and systems, to achieve a global balance of power consumption and output quality. To conclude, the traditional over-pessimistic worst-case approach leads to huge power waste. In contrast, the adaptive voltage scaling approach saves power (25% for the CORDIC application) by providing a just-needed supply voltage. The power saving is maximized (46% for CORDIC) when a more aggressive voltage over-scaling scheme is applied. These sparsely occurred circuit errors produced by aggressive voltage over-scaling are mitigated by higher level error resilient designs. For functions like FFT and CORDIC, smart error mitigation schemes were proposed to enhance reliability (soft-errors and timing-errors, respectively). Applications like Massive MIMO systems are robust against lower level errors, thanks to the intrinsically redundant antennas. This property makes it applicable to embrace digital hardware that trades quality for power savings.Comment: 190 page

    The HIPEAC vision for advanced computing in horizon 2020

    Get PDF

    Efficient and Linear CMOS Power Amplifier and Front-end Design for Broadband Fully-Integrated 28-GHz 5G Phased Arrays

    Get PDF
    Demand for data traffic on mobile networks is growing exponentially with time and on a global scale. The emerging fifth-generation (5G) wireless standard is being developed with millimeter-wave (mm-Wave) links as a key technological enabler to address this growth by a 2020 time frame. The wireless industry is currently racing to deploy mm-Wave mobile services, especially in the 28-GHz band. Previous widely-held perceptions of fundamental propagation limitations were overcome using phased arrays. Equally important for success of 5G is the development of low-power, broadband user equipment (UE) radios in commercial-grade technologies. This dissertation demonstrates design methodologies and circuit techniques to tackle the critical challenge of key phased array front-end circuits in low-cost complementary metal oxide semiconductor (CMOS) technology. Two power amplifier (PA) proof-of-concept prototypes are implemented in deeply scaled 28- nm and 40-nm CMOS processes, demonstrating state-of-the-art linearity and efficiency for extremely broadband communication signals. Subsequently, the 40 nm PA design is successfully embedded into a low-power fully-integrated transmit-receive front-end module. The 28 nm PA prototype in this dissertation is the first reported linear, bulk CMOS PA targeting low-power 5G mobile UE integrated phased array transceivers. An optimization methodology is presented to maximizing power added efficiency (PAE) in the PA output stage at a desired error vector magnitude (EVM) and range to address challenging 5G uplink requirements. Then, a source degeneration inductor in the optimized output stage is shown to further enable its embedding into a two-stage transformer-coupled PA. The inductor helps by broadening inter-stage impedance matching bandwidth, and helping to reduce distortion. Designed and fabricated in 1P7M 28 nm bulk CMOS and using a 1 V supply, the PA achieves +4.2 dBm/9% measured Pout/PAE at −25 dBc EVM for a 250 MHz-wide, 64-QAM orthogonal frequency division multiplexing (OFDM) signal with 9.6 dB peak-to-average power ratio (PAPR). The PA also achieves 35.5%/10% PAE for continuous wave signals at saturation/9.6dB back-off from saturation. To the best of the author’s knowledge, these are the highest measured PAE values among published K- and K a-band CMOS PAs to date. To drastically extend the communication bandwidth in 28 GHz-band UE devices, and to explore the potential of CMOS technology for more demanding access point (AP) devices, the second PA is demonstrated in a 40 nm process. This design supports a signal radio frequency bandwidth (RFBW) >3× the state-of-the-art without degrading output power (i.e. range), PAE (i.e. battery life), or EVM (i.e. amplifier fidelity). The three-stage PA uses higher-order, dual-resonance transformer matching networks with bandwidths optimized for wideband linearity. Digital gain control of 9 dB range is integrated for phased array operation. The gain control is a needed functionality, but it is largely absent from reported high-performance mm-Wave PAs in the literature. The PA is fabricated in a 1P6M 40 nm CMOS LP technology with 1.1 V supply, and achieves Pout/PAE of +6.7 dBm/11% for an 8×100 MHz carrier aggregation 64-QAM OFDM signal with 9.7 dB PAPR. This PA therefore is the first to demonstrate the viability of CMOS technology to address even the very challenging 5G AP/downlink signal bandwidth requirement. Finally, leveraging the developed PA design methodologies and circuits, a low power transmit-receive phased array front-end module is fully integrated in 40 nm technology. In transmit-mode, the front-end maintains the excellent performance of the 40 nm PA: achieving +5.5 dBm/9% for the same 8×100 MHz carrier aggregation signal above. In receive-mode, a 5.5 dB noise figure (NF) and a minimum third-order input intercept point (IIP₃) of −13 dBm are achieved. The performance of the implemented CMOS frontend is comparable to state-of-the-art publications and commercial products that were very recently developed in silicon germanium (SiGe) technologies for 5G communication

    Introducing 14-nm FinFET technology in Microwind

    Get PDF
    This paper describes the implementation of a high performance FinFET-based 14-nm CMOS Technology in Microwind. New concepts related to the design of FinFET and design for manufacturing are also described. The performances of a ring oscillator layout and a 6-transistor RAM memory layout are also analyzed.This paper describes the implementation of a high performance FinFET-based 14-nm CMOS Technology in Microwind. New concepts related to the design of FinFET and design for manufacturing are also described. The performances of a ring oscillator layout and a 6-transistor RAM memory layout are also analyzed

    A Survey on Low-Power Techniques with Emerging Technologies: From Devices to Systems

    Get PDF
    Nowadays, power consumption is one of the main limitations of electronic systems. In this context, novel and emerging devices provide us with new opportunities to keep the trend to low-power design. In this survey paper, we present a transversal survey on energy efficient techniques ranging from devices to architectures. The actual trends of device research, with fully-depleted planar devices, tri-gate geometries and gate-all-around structures, allows us to reach an increasingly higher level of performance while reducing the associated power. In addition, beyond the simple device properties enhancements, emerging devices also lead to innovations at circuit and architectural levels. In particular, devices whose properties can be tuned through additional terminals enable a fine and dynamic control of device threshold. They also enable designers to realize logic gates and to implement power-related techniques in a compact way unreachable to standard technologies. These innovations reduce the power consumption at the gate level and unlock new means of actuation in architectural solutions like adaptive voltage and frequency scaling

    Circuit Techniques for Low-Power and Secure Internet-of-Things Systems

    Full text link
    The coming of Internet of Things (IoT) is expected to connect the physical world to the cyber world through ubiquitous sensors, actuators and computers. The nature of these applications demand long battery life and strong data security. To connect billions of things in the world, the hardware platform for IoT systems must be optimized towards low power consumption, high energy efficiency and low cost. With these constraints, the security of IoT systems become a even more difficult problem compared to that of computer systems. A new holistic system design considering both hardware and software implementations is demanded to face these new challenges. In this work, highly robust and low-cost true random number generators (TRNGs) and physically unclonable functions (PUFs) are designed and implemented as security primitives for secret key management in IoT systems. They provide three critical functions for crypto systems including runtime secret key generation, secure key storage and lightweight device authentication. To achieve robustness and simplicity, the concept of frequency collapse in multi-mode oscillator is proposed, which can effectively amplify the desired random variable in CMOS devices (i.e. process variation or noise) and provide a runtime monitor of the output quality. A TRNG with self-tuning loop to achieve robust operation across -40 to 120 degree Celsius and 0.6 to 1V variations, a TRNG that can be fully synthesized with only standard cells and commercial placement and routing tools, and a PUF with runtime filtering to achieve robust authentication, are designed based upon this concept and verified in several CMOS technology nodes. In addition, a 2-transistor sub-threshold amplifier based "weak" PUF is also presented for chip identification and key storage. This PUF achieves state-of-the-art 1.65% native unstable bit, 1.5fJ per bit energy efficiency, and 3.16% flipping bits across -40 to 120 degree Celsius range at the same time, while occupying only 553 feature size square area in 180nm CMOS. Secondly, the potential security threats of hardware Trojan is investigated and a new Trojan attack using analog behavior of digital processors is proposed as the first stealthy and controllable fabrication-time hardware attack. Hardware Trojan is an emerging concern about globalization of semiconductor supply chain, which can result in catastrophic attacks that are extremely difficult to find and protect against. Hardware Trojans proposed in previous works are based on either design-time code injection to hardware description language or fabrication-time modification of processing steps. There have been defenses developed for both types of attacks. A third type of attack that combines the benefits of logical stealthy and controllability in design-time attacks and physical "invisibility" is proposed in this work that crosses the analog and digital domains. The attack eludes activation by a diverse set of benchmarks and evades known defenses. Lastly, in addition to security-related circuits, physical sensors are also studied as fundamental building blocks of IoT systems in this work. Temperature sensing is one of the most desired functions for a wide range of IoT applications. A sub-threshold oscillator based digital temperature sensor utilizing the exponential temperature dependence of sub-threshold current is proposed and implemented. In 180nm CMOS, it achieves 0.22/0.19K inaccuracy and 73mK noise-limited resolution with only 8865 square micrometer additional area and 75nW extra power consumption to an existing IoT system.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138779/1/kaiyuan_1.pd
    • 

    corecore