6,427 research outputs found
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Energy Efficient Learning with Low Resolution Stochastic Domain Wall Synapse Based Deep Neural Networks
We demonstrate that extremely low resolution quantized (nominally 5-state)
synapses with large stochastic variations in Domain Wall (DW) position can be
both energy efficient and achieve reasonably high testing accuracies compared
to Deep Neural Networks (DNNs) of similar sizes using floating precision
synaptic weights. Specifically, voltage controlled DW devices demonstrate
stochastic behavior as modeled rigorously with micromagnetic simulations and
can only encode limited states; however, they can be extremely energy efficient
during both training and inference. We show that by implementing suitable
modifications to the learning algorithms, we can address the stochastic
behavior as well as mitigate the effect of their low-resolution to achieve high
testing accuracies. In this study, we propose both in-situ and ex-situ training
algorithms, based on modification of the algorithm proposed by Hubara et al.
[1] which works well with quantization of synaptic weights. We train several
5-layer DNNs on MNIST dataset using 2-, 3- and 5-state DW device as synapse.
For in-situ training, a separate high precision memory unit is adopted to
preserve and accumulate the weight gradients, which are then quantized to
program the low precision DW devices. Moreover, a sizeable noise tolerance
margin is used during the training to address the intrinsic programming noise.
For ex-situ training, a precursor DNN is first trained based on the
characterized DW device model and a noise tolerance margin, which is similar to
the in-situ training. Remarkably, for in-situ inference the energy dissipation
to program the devices is only 13 pJ per inference given that the training is
performed over the entire MNIST dataset for 10 epochs
New Logic-In-Memory Paradigms: An Architectural and Technological Perspective
Processing systems are in continuous evolution thanks to the constant technological advancement and architectural progress. Over the years, computing systems have become more and more powerful, providing support for applications, such as Machine Learning, that require high computational power. However, the growing complexity of modern computing units and applications has had a strong impact on power consumption. In addition, the memory plays a key role on the overall power consumption of the system, especially when considering data-intensive applications. These applications, in fact, require a lot of data movement between the memory and the computing unit. The consequence is twofold: Memory accesses are expensive in terms of energy and a lot of time is wasted in accessing the memory, rather than processing, because of the performance gap that exists between memories and processing units. This gap is known as the memory wall or the von Neumann bottleneck and is due to the different rate of progress between complementary metal-oxide semiconductor (CMOS) technology and memories. However, CMOS scaling is also reaching a limit where it would not be possible to make further progress. This work addresses all these problems from an architectural and technological point of view by: (1) Proposing a novel Configurable Logic-in-Memory Architecture that exploits the in-memory computing paradigm to reduce the memory wall problem while also providing high performance thanks to its flexibility and parallelism; (2) exploring a non-CMOS technology as possible candidate technology for the Logic-in-Memory paradigm
Exploring New Computing Paradigms for Data-Intensive Applications
L'abstract è presente nell'allegato / the abstract is in the attachmen
PIRM: Processing In Racetrack Memories
The growth in data needs of modern applications has created significant
challenges for modern systems leading a "memory wall." Spintronic Domain Wall
Memory (DWM), related to Spin-Transfer Torque Memory (STT-MRAM), provides
near-SRAM read/write performance, energy savings and nonvolatility, potential
for extremely high storage density, and does not have significant endurance
limitations. However, DWM's benefits cannot address data access latency and
throughput limitations of memory bus bandwidth. We propose PIRM, a DWM-based
in-memory computing solution that leverages the properties of DWM nanowires and
allows them to serve as polymorphic gates. While normally DWM is accessed by
applying spin polarized currents orthogonal to the nanowire at access points to
read individual bits, transverse access along the DWM nanowire allows the
differentiation of the aggregate resistance of multiple bits in the nanowire,
akin to a multilevel cell. PIRM leverages this transverse reading to directly
provide bulk-bitwise logic of multiple adjacent operands in the nanowire,
simultaneously. Based on this in-memory logic, PIRM provides a technique to
conduct multi-operand addition and two operand multiplication using transverse
access. PIRM provides a 1.6x speedup compared to the leading DRAM PIM technique
for query applications that leverage bulk bitwise operations. Compared to the
leading PIM technique for DWM, PIRM improves performance by 6.9x, 2.3x and
energy by 5.5x, 3.4x for 8-bit addition and multiplication, respectively. For
arithmetic heavy benchmarks, PIRM reduces access latency by 2.1x, while
decreasing energy consumption by 25.2x for a reasonable 10% area overhead
versus non-PIM DWM.Comment: This paper is accepted to the IEEE/ACM Symposium on
Microarchitecture, October 2022 under the title "CORUSCANT: Fast Efficient
Processing-in-Racetrack Memories
- …