176 research outputs found
Memory and information processing in neuromorphic systems
A striking difference between brain-inspired neuromorphic processors and
current von Neumann processors architectures is the way in which memory and
processing is organized. As Information and Communication Technologies continue
to address the need for increased computational power through the increase of
cores within a digital processor, neuromorphic engineers and scientists can
complement this need by building processor architectures where memory is
distributed with the processing. In this paper we present a survey of
brain-inspired processor architectures that support models of cortical networks
and deep neural networks. These architectures range from serial clocked
implementations of multi-neuron systems to massively parallel asynchronous ones
and from purely digital systems to mixed analog/digital systems which implement
more biological-like models of neurons and synapses together with a suite of
adaptation and learning mechanisms analogous to the ones found in biological
nervous systems. We describe the advantages of the different approaches being
pursued and present the challenges that need to be addressed for building
artificial neural processing systems that can display the richness of behaviors
seen in biological systems.Comment: Submitted to Proceedings of IEEE, review of recently proposed
neuromorphic computing platforms and system
Simulation and implementation of novel deep learning hardware architectures for resource constrained devices
Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems
Embracing the Unreliability of Memory Devices for Neuromorphic Computing
The emergence of resistive non-volatile memories opens the way to highly
energy-efficient computation near- or in-memory. However, this type of
computation is not compatible with conventional ECC, and has to deal with
device unreliability. Inspired by the architecture of animal brains, we present
a manufactured differential hybrid CMOS/RRAM memory architecture suitable for
neural network implementation that functions without formal ECC. We also show
that using low-energy but error-prone programming conditions only slightly
reduces network accuracy
SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference
CNNs outperform traditional machine learning algorithms across a wide range
of applications. However, their computational complexity makes it necessary to
design efficient hardware accelerators. Most CNN accelerators focus on
exploring dataflow styles that exploit computational parallelism. However,
potential performance speedup from sparsity has not been adequately addressed.
The computation and memory footprint of CNNs can be significantly reduced if
sparsity is exploited in network evaluations. To take advantage of sparsity,
some accelerator designs explore sparsity encoding and evaluation on CNN
accelerators. However, sparsity encoding is just performed on activation or
weight and only in inference. It has been shown that activation and weight also
have high sparsity levels during training. Hence, sparsity-aware computation
should also be considered in training. To further improve performance and
energy efficiency, some accelerators evaluate CNNs with limited precision.
However, this is limited to the inference since reduced precision sacrifices
network accuracy if used in training. In addition, CNN evaluation is usually
memory-intensive, especially in training. In this paper, we propose SPRING, a
SParsity-aware Reduced-precision Monolithic 3D CNN accelerator for trainING and
inference. SPRING supports both CNN training and inference. It uses a binary
mask scheme to encode sparsities in activation and weight. It uses the
stochastic rounding algorithm to train CNNs with reduced precision without
accuracy loss. To alleviate the memory bottleneck in CNN evaluation, especially
in training, SPRING uses an efficient monolithic 3D NVM interface to increase
memory bandwidth. Compared to GTX 1080 Ti, SPRING achieves 15.6X, 4.2X and
66.0X improvements in performance, power reduction, and energy efficiency,
respectively, for CNN training, and 15.5X, 4.5X and 69.1X improvements for
inference
A Construction Kit for Efficient Low Power Neural Network Accelerator Designs
Implementing embedded neural network processing at the edge requires
efficient hardware acceleration that couples high computational performance
with low power consumption. Driven by the rapid evolution of network
architectures and their algorithmic features, accelerator designs are
constantly updated and improved. To evaluate and compare hardware design
choices, designers can refer to a myriad of accelerator implementations in the
literature. Surveys provide an overview of these works but are often limited to
system-level and benchmark-specific performance metrics, making it difficult to
quantitatively compare the individual effect of each utilized optimization
technique. This complicates the evaluation of optimizations for new accelerator
designs, slowing-down the research progress. This work provides a survey of
neural network accelerator optimization approaches that have been used in
recent works and reports their individual effects on edge processing
performance. It presents the list of optimizations and their quantitative
effects as a construction kit, allowing to assess the design choices for each
building block separately. Reported optimizations range from up to 10'000x
memory savings to 33x energy reductions, providing chip designers an overview
of design choices for implementing efficient low power neural network
accelerators
Algorithm and Hardware Co-design for Learning On-a-chip
abstract: Machine learning technology has made a lot of incredible achievements in recent years. It has rivalled or exceeded human performance in many intellectual tasks including image recognition, face detection and the Go game. Many machine learning algorithms require huge amount of computation such as in multiplication of large matrices. As silicon technology has scaled to sub-14nm regime, simply scaling down the device cannot provide enough speed-up any more. New device technologies and system architectures are needed to improve the computing capacity. Designing specific hardware for machine learning is highly in demand. Efforts need to be made on a joint design and optimization of both hardware and algorithm.
For machine learning acceleration, traditional SRAM and DRAM based system suffer from low capacity, high latency, and high standby power. Instead, emerging memories, such as Phase Change Random Access Memory (PRAM), Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM), and Resistive Random Access Memory (RRAM), are promising candidates providing low standby power, high data density, fast access and excellent scalability. This dissertation proposes a hierarchical memory modeling framework and models PRAM and STT-MRAM in four different levels of abstraction. With the proposed models, various simulations are conducted to investigate the performance, optimization, variability, reliability, and scalability.
Emerging memory devices such as RRAM can work as a 2-D crosspoint array to speed up the multiplication and accumulation in machine learning algorithms. This dissertation proposes a new parallel programming scheme to achieve in-memory learning with RRAM crosspoint array. The programming circuitry is designed and simulated in TSMC 65nm technology showing 900X speedup for the dictionary learning task compared to the CPU performance.
From the algorithm perspective, inspired by the high accuracy and low power of the brain, this dissertation proposes a bio-plausible feedforward inhibition spiking neural network with Spike-Rate-Dependent-Plasticity (SRDP) learning rule. It achieves more than 95% accuracy on the MNIST dataset, which is comparable to the sparse coding algorithm, but requires far fewer number of computations. The role of inhibition in this network is systematically studied and shown to improve the hardware efficiency in learning.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
Bio-inspired learning and hardware acceleration with emerging memories
Machine Learning has permeated many aspects of engineering, ranging from the Internet of Things (IoT) applications to big data analytics. While computing resources available to implement these algorithms have become more powerful, both in terms of the complexity of problems that can be solved and the overall computing speed, the huge energy costs involved remains a significant challenge. The human brain, which has evolved over millions of years, is widely accepted as the most efficient control and cognitive processing platform. Neuro-biological studies have established that information processing in the human brain relies on impulse like signals emitted by neurons called action potentials. Motivated by these facts, the Spiking Neural Networks (SNNs), which are a bio-plausible version of neural networks have been proposed as an alternative computing paradigm where the timing of spikes generated by artificial neurons is central to its learning and inference capabilities. This dissertation demonstrates the computational power of the SNNs using conventional CMOS and emerging nanoscale hardware platforms.
The first half of this dissertation presents an SNN architecture which is trained using a supervised spike-based learning algorithm for the handwritten digit classification problem. This network achieves an accuracy of 98.17% on the MNIST test data-set, with about 4X fewer parameters compared to the state-of-the-art neural networks achieving over 99% accuracy. In addition, a scheme for parallelizing and speeding up the SNN simulation on a GPU platform is presented. The second half of this dissertation presents an optimal hardware design for accelerating SNN inference and training with SRAM (Static Random Access Memory) and nanoscale non-volatile memory (NVM) crossbar arrays. Three prominent NVM devices are studied for realizing hardware accelerators for SNNs: Phase Change Memory (PCM), Spin Transfer Torque RAM (STT-RAM) and Resistive RAM (RRAM). The analysis shows that a spike-based inference engine with crossbar arrays of STT-RAM bit-cells is 2X and 5X more efficient compared to PCM and RRAM memories, respectively. Furthermore, the STT-RAM design has nearly 6X higher throughput per unit Watt per unit area than that of an equivalent SRAM-based (Static Random Access Memory) design. A hardware accelerator with on-chip learning on an STT-RAM memory array is also designed, requiring bits of floating-point synaptic weight precision to reach the baseline SNN algorithmic performance on the MNIST dataset. The complete design with STT-RAM crossbar array achieves nearly 20X higher throughput per unit Watt per unit mm^2 than an equivalent design with SRAM memory.
In summary, this work demonstrates the potential of spike-based neuromorphic computing algorithms and its efficient realization in hardware based on conventional CMOS as well as emerging technologies. The schemes presented here can be further extended to design spike-based systems that can be ubiquitously deployed for energy and memory constrained edge computing applications
- …