122 research outputs found
Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives
The amount of data processed in the cloud, the development of
Internet-of-Things (IoT) applications, and growing data privacy concerns force
the transition from cloud-based to edge-based processing. Limited energy and
computational resources on edge push the transition from traditional von
Neumann architectures to In-memory Computing (IMC), especially for machine
learning and neural network applications. Network compression techniques are
applied to implement a neural network on limited hardware resources.
Quantization is one of the most efficient network compression techniques
allowing to reduce the memory footprint, latency, and energy consumption. This
paper provides a comprehensive review of IMC-based Quantized Neural Networks
(QNN) and links software-based quantization approaches to IMC hardware
implementation. Moreover, open challenges, QNN design requirements,
recommendations, and perspectives along with an IMC-based QNN hardware roadmap
are provided
Asynchronous spiking neurons, the natural key to exploit temporal sparsity
Inference of Deep Neural Networks for stream signal (Video/Audio) processing in edge devices is still challenging. Unlike the most state of the art inference engines which are efficient for static signals, our brain is optimized for real-time dynamic signal processing. We believe one important feature of the brain (asynchronous state-full processing) is the key to its excellence in this domain. In this work, we show how asynchronous processing with state-full neurons allows exploitation of the existing sparsity in natural signals. This paper explains three different types of sparsity and proposes an inference algorithm which exploits all types of sparsities in the execution of already trained networks. Our experiments in three different applications (Handwritten digit recognition, Autonomous Steering and Hand-Gesture recognition) show that this model of inference reduces the number of required operations for sparse input data by a factor of one to two orders of magnitudes. Additionally, due to fully asynchronous processing this type of inference can be run on fully distributed and scalable neuromorphic hardware platforms
Doctor of Philosophy
dissertationDeep Neural Networks (DNNs) are the state-of-art solution in a growing number of tasks including computer vision, speech recognition, and genomics. However, DNNs are computationally expensive as they are carefully trained to extract and abstract features from raw data using multiple layers of neurons with millions of parameters. In this dissertation, we primarily focus on inference, e.g., using a DNN to classify an input image. This is an operation that will be repeatedly performed on billions of devices in the datacenter, in self-driving cars, in drones, etc. We observe that DNNs spend a vast majority of their runtime to runtime performing matrix-by-vector multiplications (MVM). MVMs have two major bottlenecks: fetching the matrix and performing sum-of-product operations. To address these bottlenecks, we use in-situ computing, where the matrix is stored in programmable resistor arrays, called crossbars, and sum-of-product operations are performed using analog computing. In this dissertation, we propose two hardware units, ISAAC and Newton.In ISAAC, we show that in-situ computing designs can outperform DNN digital accelerators, if they leverage pipelining, smart encodings, and can distribute a computation in time and space, within crossbars, and across crossbars. In the ISAAC design, roughly half the chip area/power can be attributed to the analog-to-digital conversion (ADC), i.e., it remains the key design challenge in mixed-signal accelerators for deep networks. In spite of the ADC bottleneck, ISAAC is able to out-perform the computational efficiency of the state-of-the-art design (DaDianNao) by 8x. In Newton, we take advantage of a number of techniques to address ADC inefficiency. These techniques exploit matrix transformations, heterogeneity, and smart mapping of computation to the analog substrate. We show that Newton can increase the efficiency of in-situ computing by an additional 2x. Finally, we show that in-situ computing, unfortunately, cannot be easily adapted to handle training of deep networks, i.e., it is only suitable for inference of already-trained networks. By improving the efficiency of DNN inference with ISAAC and Newton, we move closer to low-cost deep learning that in turn will have societal impact through self-driving cars, assistive systems for the disabled, and precision medicine
Bio-inspired Neuromorphic Computing Using Memristor Crossbar Networks
Bio-inspired neuromorphic computing systems built with emerging devices such as memristors have become an active research field. Experimental demonstrations at the network-level have suggested memristor-based neuromorphic systems as a promising candidate to overcome the von-Neumann bottleneck in future computing applications. As a hardware system that offers co-location of memory and data processing, memristor-based networks represent an efficient computing platform with minimal data transfer and high parallelism. Furthermore, active utilization of the dynamic processes during resistive switching in memristors can help realize more faithful emulation of biological device and network behaviors, with the potential to process dynamic temporal inputs efficiently.
In this thesis, I present experimental demonstrations of neuromorphic systems using fabricated memristor arrays as well as network-level simulation results. Models of resistive switching behavior in two types of memristor devices, conventional first-order and recently proposed second-order memristor devices, will be first introduced. Secondly, experimental demonstration of K-means clustering through unsupervised learning in a memristor network will be presented. The memristor based hardware systems achieved high classification accuracy (93.3%) on the standard IRIS data set, suggesting practical networks can be built with optimized memristor devices. Thirdly, implementation of a partial differential equation (PDE) solver in memristor arrays will be discussed. This work expands the capability of memristor-based computing hardware from ‘soft’ to ‘hard’ computing tasks, which require very high precision and accurate solutions. In general first-order memristors are suitable to perform tasks that are based on vector-matrix multiplications, ranging from K-means clustering to PDE solvers. On the other hand, utilizing internal device dynamics in second-order memristors can allow natural emulation of biological behaviors and enable network functions such as temporal data processing. An effort to explore second-order memristor devices and their network behaviors will be discussed. Finally, we propose ideas to build large-size passive memristor crossbar arrays, including fabrication approaches, guidelines of device structure, and analysis of the parasitic effects in larger arrays.PHDElectrical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147610/1/yjjeong_1.pd
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is and than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems
FAT: An In-Memory Accelerator with Fast Addition for Ternary Weight Neural Networks
Convolutional Neural Networks (CNNs) demonstrate excellent performance in
various applications but have high computational complexity. Quantization is
applied to reduce the latency and storage cost of CNNs. Among the quantization
methods, Binary and Ternary Weight Networks (BWNs and TWNs) have a unique
advantage over 8-bit and 4-bit quantization. They replace the multiplication
operations in CNNs with additions, which are favoured on In-Memory-Computing
(IMC) devices. IMC acceleration for BWNs has been widely studied. However,
though TWNs have higher accuracy and better sparsity than BWNs, IMC
acceleration for TWNs has limited research. TWNs on existing IMC devices are
inefficient because the sparsity is not well utilized, and the addition
operation is not efficient.
In this paper, we propose FAT as a novel IMC accelerator for TWNs. First, we
propose a Sparse Addition Control Unit, which utilizes the sparsity of TWNs to
skip the null operations on zero weights. Second, we propose a fast addition
scheme based on the memory Sense Amplifier to avoid the time overhead of both
carry propagation and writing back the carry to memory cells. Third, we further
propose a Combined-Stationary data mapping to reduce the data movement of
activations and weights and increase the parallelism across memory columns.
Simulation results show that for addition operations at the Sense Amplifier
level, FAT achieves 2.00X speedup, 1.22X power efficiency, and 1.22X area
efficiency compared with a State-Of-The-Art IMC accelerator ParaPIM. FAT
achieves 10.02X speedup and 12.19X energy efficiency compared with ParaPIM on
networks with 80% average sparsity.Comment: 14 page
Simulation and implementation of novel deep learning hardware architectures for resource constrained devices
Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems
- …