22 research outputs found
MIMONets: Multiple-Input-Multiple-Output Neural Networks Exploiting Computation in Superposition
With the advent of deep learning, progressively larger neural networks have
been designed to solve complex tasks. We take advantage of these capacity-rich
models to lower the cost of inference by exploiting computation in
superposition. To reduce the computational burden per input, we propose
Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling
many inputs at once. MIMONets augment various deep neural network architectures
with variable binding mechanisms to represent an arbitrary number of inputs in
a compositional data structure via fixed-width distributed representations.
Accordingly, MIMONets adapt nonlinear neural transformations to process the
data structure holistically, leading to a speedup nearly proportional to the
number of superposed input items in the data structure. After processing in
superposition, an unbinding mechanism recovers each transformed input of
interest. MIMONets also provide a dynamic trade-off between accuracy and
throughput by an instantaneous on-demand switching between a set of
accuracy-throughput operating points, yet within a single set of fixed
parameters. We apply the concept of MIMONets to both CNN and Transformer
architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical
evaluations show that MIMOConv achieves about 2-4 x speedup at an accuracy
delta within [+0.68, -3.18]% compared to WideResNet CNNs on CIFAR10 and
CIFAR100. Similarly, MIMOFormer can handle 2-4 inputs at once while maintaining
a high average accuracy within a [-1.07, -3.43]% delta on the long range arena
benchmark. Finally, we provide mathematical bounds on the interference between
superposition channels in MIMOFormer. Our code is available at
https://github.com/IBM/multiple-input-multiple-output-nets.Comment: accepted in NeurIPS 202
XNORBIN: A 95 TOp/s/W Hardware Accelerator for Binary Convolutional Neural Networks
Deploying state-of-the-art CNNs requires power-hungry processors and off-chip
memory. This precludes the implementation of CNNs in low-power embedded
systems. Recent research shows CNNs sustain extreme quantization, binarizing
their weights and intermediate feature maps, thereby saving 8-32\x memory and
collapsing energy-intensive sum-of-products into XNOR-and-popcount operations.
We present XNORBIN, an accelerator for binary CNNs with computation tightly
coupled to memory for aggressive data reuse. Implemented in UMC 65nm technology
XNORBIN achieves an energy efficiency of 95 TOp/s/W and an area efficiency of
2.0 TOp/s/MGE at 0.8 V
WHYPE: A Scale-Out Architecture with Wireless Over-the-Air Majority for Scalable In-memory Hyperdimensional Computing
Hyperdimensional computing (HDC) is an emerging computing paradigm that
represents, manipulates, and communicates data using long random vectors known
as hypervectors. Among different hardware platforms capable of executing HDC
algorithms, in-memory computing (IMC) has shown promise as it is very efficient
in performing matrix-vector multiplications, which are common in the HDC
algebra. Although HDC architectures based on IMC already exist, how to scale
them remains a key challenge due to collective communication patterns that
these architectures required and that traditional chip-scale networks were not
designed for. To cope with this difficulty, we propose a scale-out HDC
architecture called WHYPE, which uses wireless in-package communication
technology to interconnect a large number of physically distributed IMC cores
that either encode hypervectors or perform multiple similarity searches in
parallel. In this context, the key enabler of WHYPE is the opportunistic use of
the wireless network as a medium for over-the-air computation. WHYPE implements
an optimized source coding that allows receivers to calculate the bit-wise
majority of multiple hypervectors (a useful operation in HDC) being transmitted
concurrently over the wireless channel. By doing so, we achieve a joint
broadcast distribution and computation with a performance and efficiency
unattainable with wired interconnects, which in turn enables massive
parallelization of the architecture. Through evaluations at the on-chip network
and complete architecture levels, we demonstrate that WHYPE can bundle and
distribute hypervectors faster and more efficiently than a hypothetical wired
implementation, and that it scales well to tens of receivers. We show that the
average error rate of the majority computation is low, such that it has
negligible impact on the accuracy of HDC classification tasks.Comment: Accepted at IEEE Journal on Emerging and Selected Topics in Circuits
and Systems (JETCAS). arXiv admin note: text overlap with arXiv:2205.1088
ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning
Deep neural networks (DNNs) have surpassed human-level accuracy in a variety
of cognitive tasks but at the cost of significant memory/time requirements in
DNN training. This limits their deployment in energy and memory limited
applications that require real-time learning. Matrix-vector multiplications
(MVM) and vector-vector outer product (VVOP) are the two most expensive
operations associated with the training of DNNs. Strategies to improve the
efficiency of MVM computation in hardware have been demonstrated with minimal
impact on training accuracy. However, the VVOP computation remains a relatively
less explored bottleneck even with the aforementioned strategies. Stochastic
computing (SC) has been proposed to improve the efficiency of VVOP computation
but on relatively shallow networks with bounded activation functions and
floating-point (FP) scaling of activation gradients. In this paper, we propose
ESSOP, an efficient and scalable stochastic outer product architecture based on
the SC paradigm. We introduce efficient techniques to generalize SC for weight
update computation in DNNs with the unbounded activation functions (e.g.,
ReLU), required by many state-of-the-art networks. Our architecture reduces the
computational cost by re-using random numbers and replacing certain FP
multiplication operations by bit shift scaling. We show that the ResNet-32
network with 33 convolution layers and a fully-connected layer can be trained
with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy.
Hardware design of ESSOP at 14nm technology node shows that, compared to a
highly pipelined FP16 multiplier design, ESSOP is 82.2% and 93.7% better in
energy and area efficiency respectively for outer product computation.Comment: 5 pages. 5 figures. Accepted at ISCAS 2020 for publicatio
Scale up your In-Memory Accelerator: leveraging wireless-on-chip communication for AIMC-based CNN inference
Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.This work was supported by the WiPLASH project (g.a. 863337), founded from the European Union’s Horizon 2020 research and innovation program.Peer ReviewedPostprint (author's final draft
Mixed-precision deep learning based on computational memory
Deep neural networks (DNNs) have revolutionized the field of artificial
intelligence and have achieved unprecedented success in cognitive tasks such as
image and speech recognition. Training of large DNNs, however, is
computationally intensive and this has motivated the search for novel computing
architectures targeting this application. A computational memory unit with
nanoscale resistive memory devices organized in crossbar arrays could store the
synaptic weights in their conductance states and perform the expensive weighted
summations in place in a non-von Neumann manner. However, updating the
conductance states in a reliable manner during the weight update process is a
fundamental challenge that limits the training accuracy of such an
implementation. Here, we propose a mixed-precision architecture that combines a
computational memory unit performing the weighted summations and imprecise
conductance updates with a digital processing unit that accumulates the weight
updates in high precision. A combined hardware/software training experiment of
a multilayer perceptron based on the proposed architecture using a phase-change
memory (PCM) array achieves 97.73% test accuracy on the task of classifying
handwritten digits (based on the MNIST dataset), within 0.6% of the software
baseline. The architecture is further evaluated using accurate behavioral
models of PCM on a wide class of networks, namely convolutional neural
networks, long-short-term-memory networks, and generative-adversarial networks.
Accuracies comparable to those of floating-point implementations are achieved
without being constrained by the non-idealities associated with the PCM
devices. A system-level study demonstrates 173x improvement in energy
efficiency of the architecture when used for training a multilayer perceptron
compared with a dedicated fully digital 32-bit implementation
In-memory Realization of In-situ Few-shot Continual Learning with a Dynamically Evolving Explicit Memory
Continually learning new classes from a few training examples without
forgetting previous old classes demands a flexible architecture with an
inevitably growing portion of storage, in which new examples and classes can be
incrementally stored and efficiently retrieved. One viable architectural
solution is to tightly couple a stationary deep neural network to a dynamically
evolving explicit memory (EM). As the centerpiece of this architecture, we
propose an EM unit that leverages energy-efficient in-memory compute (IMC)
cores during the course of continual learning operations. We demonstrate for
the first time how the EM unit can physically superpose multiple training
examples, expand to accommodate unseen classes, and perform similarity search
during inference, using operations on an IMC core based on phase-change memory
(PCM). Specifically, the physical superposition of a few encoded training
examples is realized via in-situ progressive crystallization of PCM devices.
The classification accuracy achieved on the IMC core remains within a range of
1.28%--2.5% compared to that of the state-of-the-art full-precision baseline
software model on both the CIFAR-100 and miniImageNet datasets when continually
learning 40 novel classes (from only five examples per class) on top of 60 old
classes.Comment: Accepted at the European Solid-state Devices and Circuits Conference
(ESSDERC), September 202
Graphene-based Wireless Agile Interconnects for Massive Heterogeneous Multi-chip Processors
The main design principles in computer architecture have recently shifted
from a monolithic scaling-driven approach to the development of heterogeneous
architectures that tightly co-integrate multiple specialized processor and
memory chiplets. In such data-hungry multi-chip architectures, current
Networks-in-Package (NiPs) may not be enough to cater to their heterogeneous
and fast-changing communication demands. This position paper makes the case for
wireless in-package nanonetworking as the enabler of efficient and versatile
wired-wireless interconnect fabrics for massive heterogeneous processors. To
that end, the use of graphene-based antennas and transceivers with unique
frequency-beam reconfigurability in the terahertz band is proposed. The
feasibility of such a nanonetworking vision and the main research challenges
towards its realization are analyzed from the technological, communications,
and computer architecture perspectives.Comment: 8 pages, 4 figures, 1 table - Accepted at IEEE Wireless
Communications Magazin