Search CORE

210 research outputs found

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Author: Azarkhish Erfan
Benini Luca
Bonetti Andrea
Emery Stephane
Jokic Petar
Pons Marc
Publication venue
Publication date: 24/06/2021
Field of study

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Recommended from our members

Active timing margin management to improve microprocessor power efficiency

Author: Zu Yazhou
Publication venue
Publication date: 11/04/2019
Field of study

Improving power/performance efficiency is critical for today’s micro- processors. From edge devices to datacenters, lower power or higher performance always produces better systems, measured by lower cost of ownership or longer battery time. This thesis studies improving microprocessor power/performance efficiency by optimizing the pipeline timing margin. In particular, this thesis focuses on improving the efficacy of Active Timing Margin, a young technology that dynamically adjusts the margin. Active timing margin trims down the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. The key insight of this thesis is that in order to maximize active timing margin’s efficiency enhancement benefits, synergistic management from processor architecture design and system software scheduling are needed. To that end, this thesis covers the major consumers of pipeline timing margin, including temperature, voltage, and process variation. For temperature variation, the thesis proposes a table-lookup based active timing margin mechanism, and an associated temperature management scheme to minimize power consumption. For voltage variation, the thesis characterizes the limiting factors of adaptive clocking’s power saving and proposes application scheduling to maximize total system power reduction. For process variation, the thesis proposes core-level adaptive clocking reconfiguration to automatically expose inter-core variation and discusses workload scheduling and throttling management to control critical application performance. The author believes the optimization presented in this thesis can potentially benefit a variety of processor architectures as the conclusions are based on the solid measurement on state-of-the-art processors, and the research objective, active timing margin, already has wide applicability in the latest microprocessors by the time this thesis is written.Electrical and Computer Engineerin

Texas ScholarWorks

Synthesis of all-digital delay lines

Author: Cortadella Jordi
Moreno Vega Alberto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksThe synthesis of delay lines (DLs) is a core task during the generation of matched delays, ring oscillator clocks or delay monitors. The main figure of merit of a DL is the fidelity to track variability. Unfortunately, complex systems have a great diversity of timing paths that exhibit different sensitivities to static and dynamic variations. Designing DLs that capture this diversity is an ardous task. This paper proposes an algorithmic approach for the synthesis of DLs that can be integrated in a conventional design flow. The algorithm uses heuristics to perform a combinatorial search in a vast space of solutions that combine different types of gates and wire lengths. The synthesized DLs are (1) all digital, i.e., built of conventional standard cells, (2) accurate in tracking variability and (3) configurable at runtime. Experimental results with a commercial standard cell library confirm the quality of the DLs that only exhibit delay mismatches of about 1% on average over all PVT corners.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Circuits and Systems Advances in Near Threshold Computing

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

Modern society is witnessing a sea change in ubiquitous computing, in which people have embraced computing systems as an indispensable part of day-to-day existence. Computation, storage, and communication abilities of smartphones, for example, have undergone monumental changes over the past decade. However, global emphasis on creating and sustaining green environments is leading to a rapid and ongoing proliferation of edge computing systems and applications. As a broad spectrum of healthcare, home, and transport applications shift to the edge of the network, near-threshold computing (NTC) is emerging as one of the promising low-power computing platforms. An NTC device sets its supply voltage close to its threshold voltage, dramatically reducing the energy consumption. Despite showing substantial promise in terms of energy efficiency, NTC is yet to see widescale commercial adoption. This is because circuits and systems operating with NTC suffer from several problems, including increased sensitivity to process variation, reliability problems, performance degradation, and security vulnerabilities, to name a few. To realize its potential, we need designs, techniques, and solutions to overcome these challenges associated with NTC circuits and systems. The readers of this book will be able to familiarize themselves with recent advances in electronics systems, focusing on near-threshold computing

Directory of Open Access Books (DOAB)

Predicting power scalability in a reconfigurable platform

Author: Beckett P
Publication venue: RMIT University
Publication date: 01/01/2007
Field of study

This thesis focuses on the evolution of digital hardware systems. A reconfigurable platform is proposed and analysed based on thin-body, fully-depleted silicon-on-insulator Schottky-barrier transistors with metal gates and silicide source/drain (TBFDSBSOI). These offer the potential for simplified processing that will allow them to reach ultimate nanoscale gate dimensions. Technology CAD was used to show that the threshold voltage in TBFDSBSOI devices will be controllable by gate potentials that scale down with the channel dimensions while remaining within appropriate gate reliability limits. SPICE simulations determined that the magnitude of the threshold shift predicted by TCAD software would be sufficient to control the logic configuration of a simple, regular array of these TBFDSBSOI transistors as well as to constrain its overall subthreshold power growth. Using these devices, a reconfigurable platform is proposed based on a regular 6-input, 6-output NOR LUT block in which the logic and configuration functions of the array are mapped onto separate gates of the double-gate device. A new analytic model of the relationship between power (P), area (A) and performance (T) has been developed based on a simple VLSI complexity metric of the form ATσ = constant. As σ defines the performance “return” gained as a result of an increase in area, it also represents a bound on the architectural options available in power-scalable digital systems. This analytic model was used to determine that simple computing functions mapped to the reconfigurable platform will exhibit continuous power-area-performance scaling behavior. A number of simple arithmetic circuits were mapped to the array and their delay and subthreshold leakage analysed over a representative range of supply and threshold voltages, thus determining a worse-case range for the device/circuit-level parameters of the model. Finally, an architectural simulation was built in VHDL-AMS. The frequency scaling described by σ, combined with the device/circuit-level parameters predicts the overall power and performance scaling of parallel architectures mapped to the array

RMIT Research Repository

Design of variability compensation architectures of digital circuits with adaptive body bias

Author: Sanmukh Anand Siddharudh
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

The most critical concern in circuit is to achieve high level of performance with very tight power constraint. As the high performance circuits moved beyond 45nm technology one of the major issues is the parameter variation i.e. deviation in process, temperature and voltage (PVT) values from nominal specifications. A key process parameter subject to variation is the transistor threshold voltage (Vth) which impacts two important parameters: frequency and leakage power. Although the degradation can be compensated by the worstcase scenario based over-design approach, it induces remarkable power and performance overhead which is undesirable in tightly constrained designs. Dynamic voltage scaling (DVS) is a more power efficient approach, however its coarse granularity implies difficulty in handling fine grained variations. These factors have contributed to the growing interest in power aware robust circuit design. We propose a variability compensation architecture with adaptive body bias, for low power applications using 28nm FDSOI technology. The basic approach is based on a dynamic prediction and prevention of possible circuit timing errors. In our proposal we are using a Canary logic technique that enables the typical-case design. The body bias generation is based on a DLL type method which uses an external reference generator and voltage controlled delay line (VCDL) to generate the forward body bias (FBB) control signals. The adaptive technique is used for dynamic detection and correction of path failures in digital designs due to PVT variations. Instead of tuning the supply voltage, the key idea of the design approach is to tune the body bias voltage bymonitoring the error rate during operation. The FBB increases operating speed with an overhead in leakage power

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

An Ultra-Low Cost and Multicast-Enabled Asynchronous NoC for Neuromorphic Edge Computing

Author: Coffen Marcolin Demetra
Indiveri Giacomo
Krstic Milos
Ramini Simone
Su Zhe
Veronesi Alessandro
Publication venue
Publication date: 25/07/2024
Field of study

Biological brains are increasingly taken as a guide toward more efficient forms of computing. The latest frontier considers the use of spiking neural-network-based neuromorphic processors for near-sensor data processing, in order to fit the tight power and resource budgets of edge computing devices. However, a prevailing focus on brain-inspired computing and storage primitives in the design of neuromorphic systems is currently bringing a fundamental bottleneck to the forefront: chip-scale communications. While communication architectures (typically, a network-on-chip) are generally inspired by, or even borrowed from, general purpose computing, neuromorphic communications exhibit unique characteristics: they consist of the event-driven routing of small amounts of information to a large number of destinations within tight area and power budgets. This article aims at an inflection point in network-on-chip design for brain-inspired communications, revolving around the combination of cost-effective and robust asynchronous design, architecture specialization for short messaging and lightweight hardware support for tree-based multicast. When validated with functional spiking neural network traffic, the proposed NoC delivers energy savings ranging from 42% to 71% over a state-of-the-art NoC used in a real multi-core neuromorphic processor for edge computing applications.<br/

The University of Manchester - Institutional Repository

Recommended from our members

Network-on-Chip Synchronization

Author: Buckler Mark
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

Technology scaling has enabled the number of cores within a System on Chip (SoC) to increase significantly. Globally Asynchronous Locally Synchronous (GALS) systems using Dynamic Voltage and Frequency Scaling (DVFS) operate each of these cores on distinct and dynamic clock domains. The main communication method between these cores is increasingly more likely to be a Network-on-Chip (NoC). Typically, the interfaces between these clock domains experience multi-cycle synchronization latencies due to their use of “brute-force” synchronizers. This dissertation aims to improve the performance of NoCs and thereby SoCs as a whole by reducing this synchronization latency. First, a survey of NoC improvement techniques is presented. One such improvement technique: a multi-layer NoC, has been successfully simulated. Given how one of the most commonly used techniques is DVFS, a thorough analysis and simulation of brute-force synchronizer circuits in both current and future process technologies is presented. Unfortunately, a multi-cycle latency is unavoidable when using brute-force synchronizers, so predictive synchronizers which require only a single cycle of latency have been proposed. To demonstrate the impact of these predictive synchronizer circuits at a high level, multi-core system simulations incorporating these circuits have been completed. Multiple forms of GALS NoC configurations have been simulated, including multi-synchronous, NoC-synchronous, and single-synchronizer. Speedup on the SPLASH benchmark suite was measured to directly quantify the performance benefit of predictive synchronizers in a full system. Additionally, Mean Time Between Failures (MTBF) has been calculated for each NoC synchronizer configuration to determine the reliability benefit possible when using predictive synchronizers

ScholarWorks@UMass Amherst

Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

Author: Roelke George R., IV
Publication venue: AFIT Scholar
Publication date: 01/01/2006
Field of study

This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

AFTI Scholar (Air Force Institute of Technology)

CiteSeerX

Multiple-Independent-Gate Field-Effect Transistors for High Computational Density and Low Power Consumption

Author: Zhang Jian
Publication venue: Lausanne, EPFL
Publication date: 30/12/2015
Field of study

Transistors are the fundamental elements in Integrated Circuits (IC). The development of transistors significantly improves the circuit performance. Numerous technology innovations have been adopted to maintain the continuous scaling down of transistors. With all these innovations and efforts, the transistor size is approaching the natural limitations of materials in the near future. The circuits are expected to compute in a more efficient way. From this perspective, new device concepts are desirable to exploit additional functionality. On the other hand, with the continuously increased device density on the chips, reducing the power consumption has become a key concern in IC design. To overcome the limitations of Complementary Metal-Oxide-Semiconductor (CMOS) technology in computing efficiency and power reduction, this thesis introduces the multiple- independent-gate Field-Effect Transistors (FETs) with silicon nanowires and FinFET structures. The device not only has the capability of polarity control, but also provides dual-threshold- voltage and steep-subthreshold-slope operations for power reduction in circuit design. By independently modulating the Schottky junctions between metallic source/drain and semiconductor channel, the dual-threshold-voltage characteristics with controllable polarity are achieved in a single device. This property is demonstrated in both experiments and simulations. Thanks to the compact implementation of logic functions, circuit-level benchmarking shows promising performance with a configurable dual-threshold-voltage physical design, which is suitable for low-power applications. This thesis also experimentally demonstrates the steep-subthreshold-slope operation in the multiple-independent-gate FETs. Based on a positive feedback induced by weak impact ionization, the measured characteristics of the device achieve a steep subthreshold slope of 6 mV/dec over 5 decades of current. High Ion/Ioff ratio and low leakage current are also simultaneously obtained with a good reliability. Based on a physical analysis of the device operation, feasible improvements are suggested to further enhance the performance. A physics-based surface potential and drain current model is also derived for the polarity-controllable Silicon Nanowire FETs (SiNWFETs). By solving the carrier transport at Schottky junctions and in the channel, the core model captures the operation with independent gate control. It can serve as the core framework for developing a complete compact model by integrating advanced physical effects. To summarize, multiple-independent-gate SiNWFETs and FinFETs are extensively studied in terms of fabrication, modeling, and simulation. The proposed device concept expands the family of polarity-controllable FETs. In addition to the enhanced logic functionality, the polarity-controllable SiNWFETs and FinFETs with the dual-threshold-voltage and steep-subthreshold-slope operation can be promising candidates for future IC design towards low-power applications

Infoscience - École polytechnique fédérale de Lausanne