85 research outputs found
FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture
Neural Network (NN) accelerators with emerging ReRAM (resistive random access
memory) technologies have been investigated as one of the promising solutions
to address the \textit{memory wall} challenge, due to the unique capability of
\textit{processing-in-memory} within ReRAM-crossbar-based processing elements
(PEs). However, the high efficiency and high density advantages of ReRAM have
not been fully utilized due to the huge communication demands among PEs and the
overhead of peripheral circuits.
In this paper, we propose a full system stack solution, composed of a
reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and
its software system including neural synthesizer, temporal-to-spatial mapper,
and placement & routing. We highly leverage the software system to make the
hardware design compact and efficient. To satisfy the high-performance
communication demand, we optimize it with a reconfigurable routing architecture
and the placement & routing tool. To improve the computational density, we
greatly simplify the PE circuit with the spiking schema and then adopt neural
synthesizer to enable the high density computation-resources to support
different kinds of NN operations. In addition, we provide spiking memory blocks
(SMBs) and configurable logic blocks (CLBs) in hardware and leverage the
temporal-to-spatial mapper to utilize them to balance the storage and
computation requirements of NN. Owing to the end-to-end software system, we can
efficiently deploy existing deep neural networks to FPSA. Evaluations show
that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME,
the computational density of FPSA improves by 31x; for representative NNs, its
inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Neuro-memristive Circuits for Edge Computing: A review
The volume, veracity, variability, and velocity of data produced from the
ever-increasing network of sensors connected to Internet pose challenges for
power management, scalability, and sustainability of cloud computing
infrastructure. Increasing the data processing capability of edge computing
devices at lower power requirements can reduce several overheads for cloud
computing solutions. This paper provides the review of neuromorphic
CMOS-memristive architectures that can be integrated into edge computing
devices. We discuss why the neuromorphic architectures are useful for edge
devices and show the advantages, drawbacks and open problems in the field of
neuro-memristive circuits for edge computing
A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Graph-related applications have experienced significant growth in academia
and industry, driven by the powerful representation capabilities of graph.
However, efficiently executing these applications faces various challenges,
such as load imbalance, random memory access, etc. To address these challenges,
researchers have proposed various acceleration systems, including software
frameworks and hardware accelerators, all of which incorporate graph
pre-processing (GPP). GPP serves as a preparatory step before the formal
execution of applications, involving techniques such as sampling, reorder, etc.
However, GPP execution often remains overlooked, as the primary focus is
directed towards enhancing graph applications themselves. This oversight is
concerning, especially considering the explosive growth of real-world graph
data, where GPP becomes essential and even dominates system running overhead.
Furthermore, GPP methods exhibit significant variations across devices and
applications due to high customization. Unfortunately, no comprehensive work
systematically summarizes GPP. To address this gap and foster a better
understanding of GPP, we present a comprehensive survey dedicated to this area.
We propose a double-level taxonomy of GPP, considering both algorithmic and
hardware perspectives. Through listing relavent works, we illustrate our
taxonomy and conduct a thorough analysis and summary of diverse GPP techniques.
Lastly, we discuss challenges in GPP and potential future directions
MemTorch: An Open-source Simulation Framework for Memristive Deep Learning Systems
Memristive devices have shown great promise to facilitate the acceleration
and improve the power efficiency of Deep Learning (DL) systems. Crossbar
architectures constructed using memristive devices can be used to efficiently
implement various in-memory computing operations, such as Multiply-Accumulate
(MAC) and unrolled-convolutions, which are used extensively in Deep Neural
Networks (DNNs) and Convolutional Neural Networks (CNNs). Currently, there is a
lack of a modernized, open source and general high-level simulation platform
that can fully integrate any behavioral or experimental memristive device model
and its putative non-idealities into crossbar architectures within DL systems.
This paper presents such a framework, entitled MemTorch, which adopts a
modernized software engineering methodology and integrates directly with the
well-known PyTorch Machine Learning (ML) library. We fully detail the public
release of MemTorch and its release management, and use it to perform novel
simulations of memristive DL systems, which are trained and benchmarked using
the CIFAR-10 dataset. Moreover, we present a case study, in which MemTorch is
used to simulate a near-sensor in-memory computing system for seizure detection
using Pt/Hf/Ti Resistive Random Access Memory (ReRAM) devices. Our open source
MemTorch framework can be used and expanded upon by circuit and system
designers to conveniently perform customized large-scale memristive DL
simulations taking into account various unavoidable device non-idealities, as a
preliminary step before circuit-level realization.Comment: Submitted to IEEE Transactions on Neural Networks and Learning
Systems. Update: Fixed accent \'e characte
Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives
The amount of data processed in the cloud, the development of
Internet-of-Things (IoT) applications, and growing data privacy concerns force
the transition from cloud-based to edge-based processing. Limited energy and
computational resources on edge push the transition from traditional von
Neumann architectures to In-memory Computing (IMC), especially for machine
learning and neural network applications. Network compression techniques are
applied to implement a neural network on limited hardware resources.
Quantization is one of the most efficient network compression techniques
allowing to reduce the memory footprint, latency, and energy consumption. This
paper provides a comprehensive review of IMC-based Quantized Neural Networks
(QNN) and links software-based quantization approaches to IMC hardware
implementation. Moreover, open challenges, QNN design requirements,
recommendations, and perspectives along with an IMC-based QNN hardware roadmap
are provided
Data processing and information classification— an in-memory approach
9noTo live in the information society means to be surrounded by billions of electronic devices full of sensors that constantly acquire data. This enormous amount of data must be processed and classified. A solution commonly adopted is to send these data to server farms to be remotely elaborated. The drawback is a huge battery drain due to high amount of information that must be exchanged. To compensate this problem data must be processed locally, near the sensor itself. But this solution requires huge computational capabilities. While microprocessors, even mobile ones, nowadays have enough computational power, their performance are severely limited by the Memory Wall problem. Memories are too slow, so microprocessors cannot fetch enough data from them, greatly limiting their performance. A solution is the Processing-In-Memory (PIM) approach. New memories are designed that can elaborate data inside them eliminating the Memory Wall problem. In this work we present an example of such a system, using as a case of study the Bitmap Indexing algorithm. Such algorithm is used to classify data coming from many sources in parallel. We propose a hardware accelerator designed around the Processing-In-Memory approach, that is capable of implementing this algorithm and that can also be reconfigured to do other tasks or to work as standard memory. The architecture has been synthesized using CMOS technology. The results that we have obtained highlights that, not only it is possible to process and classify huge amount of data locally, but also that it is possible to obtain this result with a very low power consumption.openopenAndrighetti, M. .; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M.R.; Graziano, M.; Zamboni, M.Andrighetti, M.; Turvani, G.; Santoro, G.; Vacca, M.; Marchesin, A.; Ottati, F.; Roch, M. R.; Graziano, M.; Zamboni, M
Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design
Graph neural networks (GNNs) have shown significant accuracy improvements in
a variety of graph learning domains, sparking considerable research interest.
To translate these accuracy improvements into practical applications, it is
essential to develop high-performance and efficient hardware acceleration for
GNN models. However, designing GNN accelerators faces two fundamental
challenges: the high bandwidth requirement of GNN models and the diversity of
GNN models. Previous works have addressed the first challenge by using more
expensive memory interfaces to achieve higher bandwidth. For the second
challenge, existing works either support specific GNN models or have generic
designs with poor hardware utilization.
In this work, we tackle both challenges simultaneously. First, we identify a
new type of partition-level operator fusion, which we utilize to internally
reduce the high bandwidth requirement of GNNs. Next, we introduce
partition-level multi-threading to schedule the concurrent processing of graph
partitions, utilizing different hardware resources. To further reduce the extra
on-chip memory required by multi-threading, we propose fine-grained graph
partitioning to generate denser graph partitions. Importantly, these three
methods make no assumptions about the targeted GNN models, addressing the
challenge of model variety. We implement these methods in a framework called
SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware
accelerator. Our evaluation demonstrates that SwitchBlade achieves an average
speedup of and energy savings of compared to the
NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to
state-of-the-art specialized accelerators
Computing-In-Memory Neural Network Accelerators for Safety-Critical Systems: Can Small Device Variations Be Disastrous?
Computing-in-Memory (CiM) architectures based on emerging non-volatile memory
(NVM) devices have demonstrated great potential for deep neural network (DNN)
acceleration thanks to their high energy efficiency. However, NVM devices
suffer from various non-idealities, especially device-to-device variations due
to fabrication defects and cycle-to-cycle variations due to the stochastic
behavior of devices. As such, the DNN weights actually mapped to NVM devices
could deviate significantly from the expected values, leading to large
performance degradation. To address this issue, most existing works focus on
maximizing average performance under device variations. This objective would
work well for general-purpose scenarios. But for safety-critical applications,
the worst-case performance must also be considered. Unfortunately, this has
been rarely explored in the literature. In this work, we formulate the problem
of determining the worst-case performance of CiM DNN accelerators under the
impact of device variations. We further propose a method to effectively find
the specific combination of device variation in the high-dimensional space that
leads to the worst-case performance. We find that even with very small device
variations, the accuracy of a DNN can drop drastically, causing concerns when
deploying CiM accelerators in safety-critical applications. Finally, we show
that surprisingly none of the existing methods used to enhance average DNN
performance in CiM accelerators are very effective when extended to enhance the
worst-case performance, and further research down the road is needed to address
this problem
- …