60 research outputs found

    Cross-Layer Optimization for Power-Efficient and Robust Digital Circuits and Systems

    Full text link
    With the increasing digital services demand, performance and power-efficiency become vital requirements for digital circuits and systems. However, the enabling CMOS technology scaling has been facing significant challenges of device uncertainties, such as process, voltage, and temperature variations. To ensure system reliability, worst-case corner assumptions are usually made in each design level. However, the over-pessimistic worst-case margin leads to unnecessary power waste and performance loss as high as 2.2x. Since optimizations are traditionally confined to each specific level, those safe margins can hardly be properly exploited. To tackle the challenge, it is therefore advised in this Ph.D. thesis to perform a cross-layer optimization for digital signal processing circuits and systems, to achieve a global balance of power consumption and output quality. To conclude, the traditional over-pessimistic worst-case approach leads to huge power waste. In contrast, the adaptive voltage scaling approach saves power (25% for the CORDIC application) by providing a just-needed supply voltage. The power saving is maximized (46% for CORDIC) when a more aggressive voltage over-scaling scheme is applied. These sparsely occurred circuit errors produced by aggressive voltage over-scaling are mitigated by higher level error resilient designs. For functions like FFT and CORDIC, smart error mitigation schemes were proposed to enhance reliability (soft-errors and timing-errors, respectively). Applications like Massive MIMO systems are robust against lower level errors, thanks to the intrinsically redundant antennas. This property makes it applicable to embrace digital hardware that trades quality for power savings.Comment: 190 page

    Voltage stacking for near/sub-threshold operation

    Get PDF

    A robust ultra-low voltage CPU utilizing timing-error prevention

    Get PDF
    To minimize energy consumption of a digital circuit, logic can be operated at sub- or near-threshold voltage. Operation at this region is challenging due to device and environment variations, and resulting performance may not be adequate to all applications. This article presents two variants of a 32-bit RISC CPU targeted for near-threshold voltage. Both CPUs are placed on the same die and manufactured in 28 nm CMOS process. They employ timing-error prevention with clock stretching to enable operation with minimal safety margins while maximizing performance and energy efficiency at a given operating point. Measurements show minimum energy of 3.15 pJ/cyc at 400 mV, which corresponds to 39% energy saving compared to operation based on static signoff timing.</p

    A Survey of Recent Developments in Testability, Safety and Security of RISC-V Processors

    Get PDF
    With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    Vega: A Ten-Core SoC for IoT Endnodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode

    Get PDF
    The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7- ÎĽW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states

    Siwa: A custom RISC-V based system on chip (SOC) for low power medical applications

    Get PDF
    This work introduces the development of Siwa, a RISC-V RV32I 32-bit based core, intended as a flexible control platform for highly integrated implantable biomedical applications, and implemented on a commercial 0.18 m high voltage (HV) CMOS technology. Simulations show that Siwa can outperform commercial micro-controllers commonly used in the medical industry as control units for implantable devices, with energy requirements below the 50 pJ per clock cycle.Agencia Nacional de InvestigaciĂłn e InnovaciĂł

    Study of Radiation Effects on 28nm UTBB FDSOI Technology

    Get PDF
    With the evolution of modern Complementary Metal-Oxide-Semiconductor (CMOS) technology, transistor feature size has been scaled down to nanometers. The scaling has resulted in tremendous advantages to the integrated circuits (ICs), such as higher speed, smaller circuit size, and lower operating voltage. However, it also creates some reliability concerns. In particular, small device dimensions and low operating voltages have caused nanoscale ICs to become highly sensitive to operational disturbances, such as signal coupling, supply and substrate noise, and single event effects (SEEs) caused by ionizing particles, like cosmic neutrons and alpha particles. SEEs found in ICs can introduce transient pulses in circuit nodes or data upsets in storage cells. In well-designed ICs, SEEs appear to be the most troublesome in a space environment or at high altitudes in terrestrial environment. Techniques from the manufacturing process level up to the system design level have been developed to mitigate radiation effects. Among them, silicon-on-insulator (SOI) technologies have proven to be an effective approach to reduce single-event effects in ICs. So far, 28nm ultra-thin body and buried oxide (UTBB) Fully Depleted SOI (FDSOI) by STMicroelectronics is one of the most advanced SOI technologies in commercial applications. Its resilience to radiation effects has not been fully explored and it is of prevalent interest in the radiation effects community. Therefore, two test chips, namely ST1 and AR0, were designed and tested to study SEEs in logic circuits fabricated with this technology. The ST1 test chip was designed to evaluate SET pulse widths in logic gates. Three kinds of the on-chip pulse-width measurement detectors, namely the Vernier detector, the Pulse Capture detector and the Pulse Filter detector, were implemented in the ST1 chip. Moreover, a Circuit for Radiation Effects Self-Test (CREST) chain with combinational logic was designed to study both SET and SEU effects. The ST1 chip was tested using a heavy ion irradiation beam source in Radiation Effects Facility (RADEF), Finland. The experiment results showed that the cross-section of the 28nm UTBB-FDSOI technology is two orders lower than its bulk competitors. Laser tests were also applied to this chip to research the pulse distortion effects and the relationship between SET, SEU and the clock frequency. Total Ionizing Dose experiments were carried out at the University of Saskatchewan and European Space Agency with Co-60 gammacell radiation sources. The test results showed the devices implemented in the 28nm UTBB-FDSOI technology can maintain its functionality up to 1 Mrad(Si). In the AR0 chip, we designed five ARM Cortex-M0 cores with different logic protection levels to investigate the performance of approximate logic protecting methods. There are three custom-designed SRAM blocks in the test chip, which can also be used to measure the SEU rate. From the simulation result, we concluded that the approximate logic methodology can protect the digital logic efficiently. This research comprehensively evaluates the radiation effects in the 28nm UTBB-FDSOI technology, which provides the baseline for later radiation-hardened system designs in this technology

    Null convention logic circuits for asynchronous computer architecture

    Get PDF
    For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a &#039;bundle&#039; of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5&amp;times; larger than synchronous designs. On the other hand its area is still 2.9&amp;times; smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library
    • …
    corecore