65 research outputs found
Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks
Fully realizing the potential of acceleration for Deep Neural Networks (DNNs)
requires understanding and leveraging algorithmic properties. This paper builds
upon the algorithmic insight that bitwidth of operations in DNNs can be reduced
without compromising their classification accuracy. However, to prevent
accuracy loss, the bitwidth varies significantly across DNNs and it may even be
adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer
limited benefits to accommodate the worst-case bitwidth requirements, or lead
to a degradation in final accuracy. To alleviate these deficiencies, this work
introduces dynamic bit-level fusion/decomposition as a new dimension in the
design of DNN accelerators. We explore this dimension by designing Bit Fusion,
a bit-flexible accelerator, that constitutes an array of bit-level processing
elements that dynamically fuse to match the bitwidth of individual DNN layers.
This flexibility in the architecture enables minimizing the computation and the
communication at the finest granularity possible with no loss in accuracy. We
evaluate the benefits of BitFusion using eight real-world feed-forward and
recurrent DNNs. The proposed microarchitecture is implemented in Verilog and
synthesized in 45 nm technology. Using the synthesis results and cycle accurate
simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN
accelerators, Eyeriss and Stripes. In the same area, frequency, and process
technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss.
Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction
at 45 nm node when BitFusion area and frequency are set to those of Stripes.
Scaling to GPU technology node of 16 nm, BitFusion almost matches the
performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while
BitFusion merely consumes 895 milliwatts of power
Simulation and implementation of novel deep learning hardware architectures for resource constrained devices
Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems
Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud
Neural networks (NNs) are growing in importance and complexity. A neural
network's performance (and energy efficiency) can be bound either by
computation or memory resources. The processing-in-memory (PIM) paradigm, where
computation is placed near or within memory arrays, is a viable solution to
accelerate memory-bound NNs. However, PIM architectures vary in form, where
different PIM approaches lead to different trade-offs. Our goal is to analyze,
discuss, and contrast DRAM-based PIM architectures for NN performance and
energy efficiency. To do so, we analyze three state-of-the-art PIM
architectures: (1) UPMEM, which integrates processors and DRAM arrays into a
single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge
devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute
bit-serial operations. Our analysis reveals that PIM greatly benefits
memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when
the GPU requires memory oversubscription for a general matrix-vector
multiplication kernel; (2) Mensa improves energy efficiency and throughput by
3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3)
SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude
that the ideal PIM architecture for NN models depends on a model's distinct
attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE
Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with
arXiv:2109.1432
Neural Network Methods for Radiation Detectors and Imaging
Recent advances in image data processing through machine learning and
especially deep neural networks (DNNs) allow for new optimization and
performance-enhancement schemes for radiation detectors and imaging hardware
through data-endowed artificial intelligence. We give an overview of data
generation at photon sources, deep learning-based methods for image processing
tasks, and hardware solutions for deep learning acceleration. Most existing
deep learning approaches are trained offline, typically using large amounts of
computational resources. However, once trained, DNNs can achieve fast inference
speeds and can be deployed to edge devices. A new trend is edge computing with
less energy consumption (hundreds of watts or less) and real-time analysis
potential. While popularly used for edge computing, electronic-based hardware
accelerators ranging from general purpose processors such as central processing
units (CPUs) to application-specific integrated circuits (ASICs) are constantly
reaching performance limits in latency, energy consumption, and other physical
constraints. These limits give rise to next-generation analog neuromorhpic
hardware platforms, such as optical neural networks (ONNs), for high parallel,
low latency, and low energy computing to boost deep learning acceleration
- …