348 research outputs found
MoRS: An approximate fault modelling framework for reduced-voltage SRAMs
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g, GPUs, FPGAs, ASICs to achieve high performance. Modern workloads such as Deep Neural Networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy-efficiency of such memories directly leads to an efficient system. One of the common methods to save energy is undervolting i.e., supply voltage underscaling below the nominal level. Such systems can be safely undervolted without incurring faults down to a certain voltage limit. This safe range is also called voltage guardband. However, reducing voltage below the guardband level without decreasing frequency causes timing-based faults. In this paper, we propose MoRS, a framework that generates the first approximate undervolting fault model using real faults extracted from experimental undervolting studies on SRAMs to build the model. We inject the faults generated by MoRS into the on-chip memory of the DNN accelerator to evaluate the resilience of the system under the test. MoRS has the advantage of simplicity without any need for high-time overhead experiments while being accurate enough in comparison to a fully randomly-generated fault injection approach. We evaluate our experiment in popular DNN workloads by mapping weights to SRAMs and measure the accuracy difference between the output of the MoRS and the real data. Our results show that the maximum difference between real fault data and the output fault model of MoRS is 6.21%, whereas the maximum difference between real data and random fault injection model is 23.2%. In terms of average proximity to the real data, the output of MoRS outperforms the random fault injection approach by 3.21x.This work is partially funded by Open Transprecision Computing (OPRECOM) project, Summer of Code 2020.Peer ReviewedPostprint (author's final draft
Always-On 674uW @ 4GOP/s Error Resilient Binary Neural Networks with Aggressive SRAM Voltage Scaling on a 22nm IoT End-Node
Binary Neural Networks (BNNs) have been shown to be robust to random
bit-level noise, making aggressive voltage scaling attractive as a power-saving
technique for both logic and SRAMs. In this work, we introduce the first fully
programmable IoT end-node system-on-chip (SoC) capable of executing
software-defined, hardware-accelerated BNNs at ultra-low voltage. Our SoC
exploits a hybrid memory scheme where error-vulnerable SRAMs are complemented
by reliable standard-cell memories to safely store critical data under
aggressive voltage scaling. On a prototype in 22nm FDX technology, we
demonstrate that both the logic and SRAM voltage can be dropped to 0.5Vwithout
any accuracy penalty on a BNN trained for the CIFAR-10 dataset, improving
energy efficiency by 2.2X w.r.t. nominal conditions. Furthermore, we show that
the supply voltage can be dropped to 0.42V (50% of nominal) while keeping more
than99% of the nominal accuracy (with a bit error rate ~1/1000). In this
operating point, our prototype performs 4Gop/s (15.4Inference/s on the CIFAR-10
dataset) by computing up to 13binary ops per pJ, achieving 22.8 Inference/s/mW
while keeping within a peak power envelope of 674uW - low enough to enable
always-on operation in ultra-low power smart cameras, long-lifetime
environmental sensors, and insect-sized pico-drones.Comment: Submitted to ISICAS2020 journal special issu
MoRS: An approximate fault modelling framework for reduced-voltage SRAMs
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g, GPUs, FPGAs, ASICs to achieve high performance. Modern workloads such as Deep Neural Networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy-efficiency of such memories directly leads to an efficient system. One of the common methods to save energy is undervolting i.e., supply voltage underscaling below the nominal level. Such systems can be safely undervolted without incurring faults down to a certain voltage limit. This safe range is also called voltage guardband. However, reducing voltage below the guardband level without decreasing frequency causes timing-based faults. In this paper, we propose MoRS, a framework that generates the first approximate undervolting fault model using real faults extracted from experimental undervolting studies on SRAMs to build the model. We inject the faults generated by MoRS into the on-chip memory of the DNN accelerator to evaluate the resilience of the system under the test. MoRS has the advantage of simplicity without any need for high-time overhead experiments while being accurate enough in comparison to a fully randomly-generated fault injection approach. We evaluate our experiment in popular DNN workloads by mapping weights to SRAMs and measure the accuracy difference between the output of the MoRS and the real data. Our results show that the maximum difference between real fault data and the output fault model of MoRS is 6.21%, whereas the maximum difference between real data and random fault injection model is 23.2%. In terms of average proximity to the real data, the output of MoRS outperforms the random fault injection approach by 3.21x.This work is partially funded by Open Transprecision Computing (OPRECOM) project, Summer of Code 2020.Peer ReviewedPostprint (author's final draft
Computing the Similarity Estimate Using Approximate Memory
In many computing applications there is a need to compute the similarity of sets of elements. When the sets have many elements or comparison involves many sets, computing the similarity requires significant computational effort and storage capacity. As in most cases, a reasonably accurate estimate is sufficient, many algorithms for similarity estimation have been proposed during the last decades. Those algorithms compute signatures for the sets and use them to estimate similarity. However, as the number of sets that need to be compared grows, even these similarity estimation algorithms require significant memory with its associated power dissipation. This article for the first time considers the use of approximate memories for similarity estimation. A theoretical analysis and simulation results are provided; initially it is shown that similarity sketches can tolerate large bit error rates and thus, they can benefit from using approximate memories without substantially compromising the accuracy of the similarity estimate. An understanding of the effect of errors in the stored signatures on the similarity estimate is pursued. A scheme to mitigate the impact of errors is presented; the proposed scheme tolerates even larger bit error rates and does not need additional memory. For example, bit error rates of up to 10 4 have less than a 1% impact on the accuracy of the estimate when the memory is unprotected, and larger bit errors rates can be tolerated if the memory is parity protected. These findings can be used for voltage supply scaling and increasing the refresh time in SRAMs and DRAMs. Based on those initial results, an enhanced implementation is further proposed for unprotected memories that further extends the range of tolerated BERs and enables power savings of up to 61.31% for SRAMs. In conclusion, this article shows that the use of approximate memories in sketches for similarity estimation provides significant benefits with a negligible impact on accuracy.This work was supported by ACHILLES project PID2019-104207RB-I00 and Go2Edge network RED2018-102585-T funded by the Spanish Agencia Estatal de Investigación (AEI) 10.13039/501100011033 and by the Madrid Community research project TAPIR-CM under Grant P2018/TCS-4496. The research of S. Liu and F. Lombardi was supported by NSF under Grants CCF-1953961 and 1812467
TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays.
SRAM arrays are widely used in a majority of specialized or general-purpose
chips that perform the computation to store data inside the chip. Thus,
SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs.
However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high
energy consumption, 3) high TRNG latency, and 4) the inability to generate true
random numbers continuously, which limits the application space of SRAM-based
TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes
these four key limitations and thus, extends the application space of
SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput,
energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous
operation. TuRaN leverages the key observation that accessing SRAM cells
results in random access failures when the supply voltage is reduced below the
manufacturer-recommended supply voltage. TuRaN generates random numbers at high
throughput by repeatedly accessing SRAM cells with reduced supply voltage and
post-processing the resulting random faults using the SHA-256 hash function. To
demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different
process nodes and analyze the potential of access failure for use as an entropy
source. We verify and support our simulation results by conducting real-world
experiments on two commercial off-the-shelf FPGA boards. We evaluate the
quality of the random numbers generated by TuRaN using the widely-adopted NIST
standard randomness tests and observe that TuRaN passes all tests. TuRaN
generates true random numbers with (i) an average (maximum) throughput of
1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us
latency
Recommended from our members
A RISC-V Vector Processor With Simultaneous-Switching Switched-Capacitor DC-DC Converters in 28 nm FDSOI
This work demonstrates a RISC-V vector microprocessor implemented in 28 nm FDSOI with fully integrated simultaneous-switching switched-capacitor DC-DC (SC DC-DC) converters and adaptive clocking that generates four on-chip voltages between 0.45 and 1 V using only 1.0 V core and 1.8 V IO voltage inputs. The converters achieve high efficiency at the system level by switching simultaneously to avoid charge-sharing losses and by using an adaptive clock to maximize performance for the resulting voltage ripple. Details about the implementation of the DC-DC switches, DC-DC controller, and adaptive clock are provided, and the sources of conversion loss are analyzed based on measured results. This system pushes the capabilities of dynamic voltage scaling by enabling fast transitions (20 ns), simple packaging (no off-chip passives), low area overhead (16%), high conversion efficiency (80%-86%), and high energy efficiency (26.2 DP GFLOPS/W) for mobile devices
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
Study and development of innovative strategies for energy-efficient cross-layer design of digital VLSI systems based on Approximate Computing
The increasing demand on requirements for high performance and energy efficiency in modern digital systems has led to the research of new design approaches that are able to go beyond the established energy-performance tradeoff. Looking at scientific literature, the Approximate Computing paradigm has been particularly prolific. Many applications in the domain of signal processing, multimedia, computer vision, machine learning are known to be particularly resilient to errors occurring on their input data and during computation, producing outputs that, although degraded, are still largely acceptable from the point of view of quality. The Approximate Computing design paradigm leverages the characteristics of this group of applications to develop circuits, architectures, algorithms that, by relaxing design constraints, perform their computations in an approximate or inexact manner reducing energy consumption. This PhD research aims to explore the design of hardware/software architectures based on Approximate Computing techniques, filling the gap in literature regarding effective applicability and deriving a systematic methodology to characterize its benefits and tradeoffs. The main contributions of this work are: -the introduction of approximate memory management inside the Linux OS, allowing dynamic allocation and de-allocation of approximate memory at user level, as for normal exact memory; - the development of an emulation environment for platforms with approximate memory units, where faults are injected during the simulation based on models that reproduce the effects on memory cells of circuital and architectural techniques for approximate memories; -the implementation and analysis of the impact of approximate memory hardware on real applications: the H.264 video encoder, internally modified to allocate selected data buffers in approximate memory, and signal processing applications (digital filter) using approximate memory for input/output buffers and tap registers; -the development of a fully reconfigurable and combinatorial floating point unit, which can work with reduced precision formats
- …