2,374 research outputs found
Energy Efficient Hardware Design of Neural Networks
abstract: Hardware implementation of deep neural networks is earning significant importance nowadays. Deep neural networks are mathematical models that use learning algorithms inspired by the brain. Numerous deep learning algorithms such as multi-layer perceptrons (MLP) have demonstrated human-level recognition accuracy in image and speech classification tasks. Multiple layers of processing elements called neurons with several connections between them called synapses are used to build these networks. Hence, it involves operations that exhibit a high level of parallelism making it computationally and memory intensive. Constrained by computing resources and memory, most of the applications require a neural network which utilizes less energy. Energy efficient implementation of these computationally intense algorithms on neuromorphic hardware demands a lot of architectural optimizations. One of these optimizations would be the reduction in the network size using compression and several studies investigated compression by introducing element-wise or row-/column-/block-wise sparsity via pruning and regularization. Additionally, numerous recent works have concentrated on reducing the precision of activations and weights with some reducing to a single bit. However, combining various sparsity structures with binarized or very-low-precision (2-3 bit) neural networks have not been comprehensively explored. Output activations in these deep neural network algorithms are habitually non-binary making it difficult to exploit sparsity. On the other hand, biologically realistic models like spiking neural networks (SNN) closely mimic the operations in biological nervous systems and explore new avenues for brain-like cognitive computing. These networks deal with binary spikes, and they can exploit the input-dependent sparsity or redundancy to dynamically scale the amount of computation in turn leading to energy-efficient hardware implementation. This work discusses configurable spiking neuromorphic architecture that supports multiple hidden layers exploiting hardware reuse. It also presents design techniques for minimum-area/-energy DNN hardware with minimal degradation in accuracy. Area, performance and energy results of these DNN and SNN hardware is reported for the MNIST dataset. The Neuromorphic hardware designed for SNN algorithm in 28nm CMOS demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4 - 773 (nJ) per classification). The optimized DNN hardware designed in 40nm CMOS that combines 8X structured compression and 3-bit weight precision showed 98.4% accuracy at 33 (nJ) per classification.Dissertation/ThesisMasters Thesis Electrical Engineering 201
Dynamic resource allocation in a hierarchical multiprocessor system: A preliminary study
An integrated system approach to dynamic resource allocation is proposed. Some of the problems in dynamic resource allocation and the relationship of these problems to system structures are examined. A general dynamic resource allocation scheme is presented. A hierarchial system architecture which dynamically maps between processor structure and programs at multiple levels of instantiations is described. Simulation experiments were conducted to study dynamic resource allocation on the proposed system. Preliminary evaluation based on simple dynamic resource allocation algorithms indicates that with the proposed system approach, the complexity of dynamic resource management could be significantly reduced while achieving reasonable effective dynamic resource allocation
ROOT - A C++ Framework for Petabyte Data Storage, Statistical Analysis and Visualization
ROOT is an object-oriented C++ framework conceived in the high-energy physics
(HEP) community, designed for storing and analyzing petabytes of data in an
efficient way. Any instance of a C++ class can be stored into a ROOT file in a
machine-independent compressed binary format. In ROOT the TTree object
container is optimized for statistical data analysis over very large data sets
by using vertical data storage techniques. These containers can span a large
number of files on local disks, the web, or a number of different shared file
systems. In order to analyze this data, the user can chose out of a wide set of
mathematical and statistical functions, including linear algebra classes,
numerical algorithms such as integration and minimization, and various methods
for performing regression analysis (fitting). In particular, ROOT offers
packages for complex data modeling and fitting, as well as multivariate
classification based on machine learning techniques. A central piece in these
analysis tools are the histogram classes which provide binning of one- and
multi-dimensional data. Results can be saved in high-quality graphical formats
like Postscript and PDF or in bitmap formats like JPG or GIF. The result can
also be stored into ROOT macros that allow a full recreation and rework of the
graphics. Users typically create their analysis macros step by step, making use
of the interactive C++ interpreter CINT, while running over small data samples.
Once the development is finished, they can run these macros at full compiled
speed over large data sets, using on-the-fly compilation, or by creating a
stand-alone batch program. Finally, if processing farms are available, the user
can reduce the execution time of intrinsically parallel tasks - e.g. data
mining in HEP - by using PROOF, which will take care of optimally distributing
the work over the available resources in a transparent way
Analysis and Modeling of Advanced PIM Architecture Design Tradeoffs
A major trend in high performance computer architecture over the last two decades is the migration of memory in the form of high speed caches onto the microprocessor semiconductor die. Where temporal locality in the computation is high, caches prove very effective at hiding memory access latency and contention for communication resources. However where temporal locality is absent, caches may exhibit low hit rates resulting in poor operational efficiency. Vector computing exploiting pipelined arithmetic units and memory access address this challenge for certain forms of data access patterns, for example involving long contiguous data sets exhibiting high spatial locality. But for many advanced applications for science, technology, and national security at least some data access patterns are not consistent to the restricted forms well handled by either caches or vector processing. An important alternative is the reverse strategy; that of migrating logic in to the main memory (DRAM) and performing those operations directly on the data stored there. Processor in Memory (PIM) architecture has advanced to the point where it may fill this role and provide an important new mechanism for improving performance and efficiency of future supercomputers for a broad range of applications. One important project considering both the role of PIM in supercomputer architecture and the design of such PIM components is the Cray Cascade Project sponsored by the DARPA High Productivity Computing Program. Cascade is a Petaflops scale computer targeted for deployment at the end of the decade that merges the raw speed of an advanced custom vector architecture with the high memory bandwidth processing delivered by an innovative class of PIM architecture. The work represented here was performed under the Cascade project to explore critical design space issues that will determine the value of PIM in supercomputers and contribute to the optimization of its design. But this work also has strong relevance to hybrid systems comprising a combination of conventional microprocessors and advanced PIM based intelligent main memory
Containing Analog Data Deluge at Edge through Frequency-Domain Compression in Collaborative Compute-in-Memory Networks
Edge computing is a promising solution for handling high-dimensional,
multispectral analog data from sensors and IoT devices for applications such as
autonomous drones. However, edge devices' limited storage and computing
resources make it challenging to perform complex predictive modeling at the
edge. Compute-in-memory (CiM) has emerged as a principal paradigm to minimize
energy for deep learning-based inference at the edge. Nevertheless, integrating
storage and processing complicates memory cells and/or memory peripherals,
essentially trading off area efficiency for energy efficiency. This paper
proposes a novel solution to improve area efficiency in deep learning inference
tasks. The proposed method employs two key strategies. Firstly, a Frequency
domain learning approach uses binarized Walsh-Hadamard Transforms, reducing the
necessary parameters for DNN (by 87% in MobileNetV2) and enabling
compute-in-SRAM, which better utilizes parallelism during inference. Secondly,
a memory-immersed collaborative digitization method is described among CiM
arrays to reduce the area overheads of conventional ADCs. This facilitates more
CiM arrays in limited footprint designs, leading to better parallelism and
reduced external memory accesses. Different networking configurations are
explored, where Flash, SA, and their hybrid digitization steps can be
implemented using the memory-immersed scheme. The results are demonstrated
using a 65 nm CMOS test chip, exhibiting significant area and energy savings
compared to a 40 nm-node 5-bit SAR ADC and 5-bit Flash ADC. By processing
analog data more efficiently, it is possible to selectively retain valuable
data from sensors and alleviate the challenges posed by the analog data deluge.Comment: arXiv admin note: text overlap with arXiv:2307.03863,
arXiv:2309.0177
- …