101 research outputs found
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
An ultra-low power in-memory computing cell for binarized neural networks
Deep Neural Networks (DNN’s) are widely used in many artificial intelligence applications such as image classification and image recognition. Data movement in DNN’s results in increased power consumption. The primary reason behind the energy-expensive data movement in DNN’s is due to the conventional Von Neuman architecture in which computing unit and memory are physically separated. To address the issue of energy-expensive data movement in DNN’s in-memory computing schemes are proposed in the literature. The fundamental principle behind in-memory computing is to enable the vector computations closer to the memory. In-memory computing schemes based on CMOS technologies are of great importance nowadays due to the ease of massive production and commercialization. However, many of the proposed in-memory computing schemes suffer from power and performance degradation. Besides, some of them are capable of reducing power consumption only to a small extent and this requires sacrificing the overall signal to noise ratio (SNR). This thesis discusses an efficient In-Memory Computing (IMC) cell for Binarized Neural Networks (BNNs). Moreover, IMC cell was modelled using the simplest current computing method. In this thesis, the developed IMC cell is a practical solution to the energy-expensive data movement within the BNNs. A 4-bit Digital to Analog Converter (DAC) is designed and simulated using 130nm CMOS process. Using the 4-bit DAC the functionality of IMC scheme for BNNs is demonstrated. The optimised 4-bit DAC shows that it is a powerful IMC method for BNNs. The results presented in this thesis show this approach of IMC is capable of accurately performing dot operation between the input activations and the weights. Furthermore, 4-bit DAC provides a 4-bit weight precision, which provides an effective means to improve the overall accuracy
DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory
Processing-in-memory (PIM), as a novel computing paradigm, provides
significant performance benefits from the aspect of effective data movement
reduction. SRAM-based PIM has been demonstrated as one of the most promising
candidates due to its endurance and compatibility. However, the integration
density of SRAM-based PIM is much lower than other non-volatile memory-based
ones, due to its inherent 6T structure for storing a single bit. Within
comparable area constraints, SRAM-based PIM exhibits notably lower capacity.
Thus, aiming to unleash its capacity potential, we propose DDC-PIM, an
efficient algorithm/architecture co-design methodology that effectively doubles
the equivalent data capacity. At the algorithmic level, we propose a
filter-wise complementary correlation (FCC) algorithm to obtain a bitwise
complementary pair. At the architecture level, we exploit the intrinsic
cross-coupled structure of 6T SRAM to store the bitwise complementary pair in
their complementary states (), thereby maximizing the data
capacity of each SRAM cell. The dual-broadcast input structure and
reconfigurable unit support both depthwise and pointwise convolution, adhering
to the requirements of various neural networks. Evaluation results show that
DDC-PIM yields about speedup on MobileNetV2 and on
EfficientNet-B0 with negligible accuracy loss compared with PIM baseline
implementation. Compared with state-of-the-art SRAM-based PIM macros, DDC-PIM
achieves up to and improvement in weight density and
area efficiency, respectively.Comment: 14 pages, to be published in IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems (TCAD
- …