21 research outputs found

    XNORBIN: A 95 TOp/s/W Hardware Accelerator for Binary Convolutional Neural Networks

    Full text link
    Deploying state-of-the-art CNNs requires power-hungry processors and off-chip memory. This precludes the implementation of CNNs in low-power embedded systems. Recent research shows CNNs sustain extreme quantization, binarizing their weights and intermediate feature maps, thereby saving 8-32\x memory and collapsing energy-intensive sum-of-products into XNOR-and-popcount operations. We present XNORBIN, an accelerator for binary CNNs with computation tightly coupled to memory for aggressive data reuse. Implemented in UMC 65nm technology XNORBIN achieves an energy efficiency of 95 TOp/s/W and an area efficiency of 2.0 TOp/s/MGE at 0.8 V

    WHYPE: A Scale-Out Architecture with Wireless Over-the-Air Majority for Scalable In-memory Hyperdimensional Computing

    Full text link
    Hyperdimensional computing (HDC) is an emerging computing paradigm that represents, manipulates, and communicates data using long random vectors known as hypervectors. Among different hardware platforms capable of executing HDC algorithms, in-memory computing (IMC) has shown promise as it is very efficient in performing matrix-vector multiplications, which are common in the HDC algebra. Although HDC architectures based on IMC already exist, how to scale them remains a key challenge due to collective communication patterns that these architectures required and that traditional chip-scale networks were not designed for. To cope with this difficulty, we propose a scale-out HDC architecture called WHYPE, which uses wireless in-package communication technology to interconnect a large number of physically distributed IMC cores that either encode hypervectors or perform multiple similarity searches in parallel. In this context, the key enabler of WHYPE is the opportunistic use of the wireless network as a medium for over-the-air computation. WHYPE implements an optimized source coding that allows receivers to calculate the bit-wise majority of multiple hypervectors (a useful operation in HDC) being transmitted concurrently over the wireless channel. By doing so, we achieve a joint broadcast distribution and computation with a performance and efficiency unattainable with wired interconnects, which in turn enables massive parallelization of the architecture. Through evaluations at the on-chip network and complete architecture levels, we demonstrate that WHYPE can bundle and distribute hypervectors faster and more efficiently than a hypothetical wired implementation, and that it scales well to tens of receivers. We show that the average error rate of the majority computation is low, such that it has negligible impact on the accuracy of HDC classification tasks.Comment: Accepted at IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS). arXiv admin note: text overlap with arXiv:2205.1088

    ESSOP: Efficient and Scalable Stochastic Outer Product Architecture for Deep Learning

    Full text link
    Deep neural networks (DNNs) have surpassed human-level accuracy in a variety of cognitive tasks but at the cost of significant memory/time requirements in DNN training. This limits their deployment in energy and memory limited applications that require real-time learning. Matrix-vector multiplications (MVM) and vector-vector outer product (VVOP) are the two most expensive operations associated with the training of DNNs. Strategies to improve the efficiency of MVM computation in hardware have been demonstrated with minimal impact on training accuracy. However, the VVOP computation remains a relatively less explored bottleneck even with the aforementioned strategies. Stochastic computing (SC) has been proposed to improve the efficiency of VVOP computation but on relatively shallow networks with bounded activation functions and floating-point (FP) scaling of activation gradients. In this paper, we propose ESSOP, an efficient and scalable stochastic outer product architecture based on the SC paradigm. We introduce efficient techniques to generalize SC for weight update computation in DNNs with the unbounded activation functions (e.g., ReLU), required by many state-of-the-art networks. Our architecture reduces the computational cost by re-using random numbers and replacing certain FP multiplication operations by bit shift scaling. We show that the ResNet-32 network with 33 convolution layers and a fully-connected layer can be trained with ESSOP on the CIFAR-10 dataset to achieve baseline comparable accuracy. Hardware design of ESSOP at 14nm technology node shows that, compared to a highly pipelined FP16 multiplier design, ESSOP is 82.2% and 93.7% better in energy and area efficiency respectively for outer product computation.Comment: 5 pages. 5 figures. Accepted at ISCAS 2020 for publicatio

    Scale up your In-Memory Accelerator: leveraging wireless-on-chip communication for AIMC-based CNN inference

    Get PDF
    Analog In-Memory Computing (AIMC) is emerging as a disruptive paradigm for heterogeneous computing, potentially delivering orders of magnitude better peak performance and efficiency over traditional digital signal processing architectures on Matrix-Vector multiplication. However, to sustain this throughput in real-world applications, AIMC tiles must be supplied with data at very high bandwidth and low latency; this poses an unprecedented pressure on the on-chip communication infrastructure, which becomes the system's performance and efficiency bottleneck. In this context, the performance and plasticity of emerging on-chip wireless communication paradigms provide the required breakthrough to up-scale on-chip communication in large AIMC devices. This work presents a many-tile AIMC architecture with inter-tile wireless communication that integrates multiple heterogeneous computing clusters, embedding a mix of parallel RISC-V cores and AIMC tiles. We perform an extensive design space exploration of the proposed architecture and discuss the benefits of exploiting emerging on-chip communication technologies such as wireless transceivers in the millimeter-wave and terahertz bands.This work was supported by the WiPLASH project (g.a. 863337), founded from the European Union’s Horizon 2020 research and innovation program.Peer ReviewedPostprint (author's final draft

    Mixed-precision deep learning based on computational memory

    Full text link
    Deep neural networks (DNNs) have revolutionized the field of artificial intelligence and have achieved unprecedented success in cognitive tasks such as image and speech recognition. Training of large DNNs, however, is computationally intensive and this has motivated the search for novel computing architectures targeting this application. A computational memory unit with nanoscale resistive memory devices organized in crossbar arrays could store the synaptic weights in their conductance states and perform the expensive weighted summations in place in a non-von Neumann manner. However, updating the conductance states in a reliable manner during the weight update process is a fundamental challenge that limits the training accuracy of such an implementation. Here, we propose a mixed-precision architecture that combines a computational memory unit performing the weighted summations and imprecise conductance updates with a digital processing unit that accumulates the weight updates in high precision. A combined hardware/software training experiment of a multilayer perceptron based on the proposed architecture using a phase-change memory (PCM) array achieves 97.73% test accuracy on the task of classifying handwritten digits (based on the MNIST dataset), within 0.6% of the software baseline. The architecture is further evaluated using accurate behavioral models of PCM on a wide class of networks, namely convolutional neural networks, long-short-term-memory networks, and generative-adversarial networks. Accuracies comparable to those of floating-point implementations are achieved without being constrained by the non-idealities associated with the PCM devices. A system-level study demonstrates 173x improvement in energy efficiency of the architecture when used for training a multilayer perceptron compared with a dedicated fully digital 32-bit implementation

    In-memory Realization of In-situ Few-shot Continual Learning with a Dynamically Evolving Explicit Memory

    Full text link
    Continually learning new classes from a few training examples without forgetting previous old classes demands a flexible architecture with an inevitably growing portion of storage, in which new examples and classes can be incrementally stored and efficiently retrieved. One viable architectural solution is to tightly couple a stationary deep neural network to a dynamically evolving explicit memory (EM). As the centerpiece of this architecture, we propose an EM unit that leverages energy-efficient in-memory compute (IMC) cores during the course of continual learning operations. We demonstrate for the first time how the EM unit can physically superpose multiple training examples, expand to accommodate unseen classes, and perform similarity search during inference, using operations on an IMC core based on phase-change memory (PCM). Specifically, the physical superposition of a few encoded training examples is realized via in-situ progressive crystallization of PCM devices. The classification accuracy achieved on the IMC core remains within a range of 1.28%--2.5% compared to that of the state-of-the-art full-precision baseline software model on both the CIFAR-100 and miniImageNet datasets when continually learning 40 novel classes (from only five examples per class) on top of 60 old classes.Comment: Accepted at the European Solid-state Devices and Circuits Conference (ESSDERC), September 202

    Graphene-based Wireless Agile Interconnects for Massive Heterogeneous Multi-chip Processors

    Full text link
    The main design principles in computer architecture have recently shifted from a monolithic scaling-driven approach to the development of heterogeneous architectures that tightly co-integrate multiple specialized processor and memory chiplets. In such data-hungry multi-chip architectures, current Networks-in-Package (NiPs) may not be enough to cater to their heterogeneous and fast-changing communication demands. This position paper makes the case for wireless in-package nanonetworking as the enabler of efficient and versatile wired-wireless interconnect fabrics for massive heterogeneous processors. To that end, the use of graphene-based antennas and transceivers with unique frequency-beam reconfigurability in the terahertz band is proposed. The feasibility of such a nanonetworking vision and the main research challenges towards its realization are analyzed from the technological, communications, and computer architecture perspectives.Comment: 8 pages, 4 figures, 1 table - Accepted at IEEE Wireless Communications Magazin

    In-memory Vector Symbolic Architectures

    No full text
    The field of Artificial Intelligence (AI) has achieved enormous progress in the past decade thanks primarily to deep neural network architectures and specialized hardware that support training the models within a reasonable time. However, since then a trend has emerged where, for solving increasingly difficult cognitive tasks, the model complexity in terms of the number of parameters and energy spent on training the models and the size of the datasets used for benchmarking has grown steadily every year. Yet, the capabilities of each model are limited to a narrow task such as classification or translation. The validity of this approach in building bigger and more power-hungry models needs to be critically questioned. For example, the human brain originally conceived from a tip of a 3-millimeter-long neural tube slowly grows over a period of approximately 20 years into a device that can perform a wide variety of complex cognitive tasks. It operates at a frugal power budget as low as 20W, requiring fewer examples than an AI model to learn new concepts. The emerging brain-inspired computing paradigm known as vector symbolic architecture (VSA) offers interesting avenues to advance the field of AI along the path of how the human brain works. For starters, it only requires a few examples for training and does not entail computationally expensive iterative gradient updates. It however requires data to be represented in extremely high dimensional vectors, having dimensions typically in the order of thousands, certainly larger than the size of the data path of any classical computer. Because of its necessity to manipulate these high dimensional (HD) vectors, the intensity of memory accesses dominates the computational intensity, creating a bottleneck in conventional von Neumann computing architectures. In-memory computing (IMC) on the other hand is a type of so-called non-von Neumann computing architecture that can bring down the cost of data movement between the processing unit and memory unit by keeping the majority of data stationary in memory and executing parallel computations using enhanced peripheral circuits. At the core of IMC, computations are performed on a noisy analog fabric by exploiting the laws of physics. Thus results are not guaranteed to be deterministic, as they are reproducible only in a probabilistic sense, leading to an interesting set of opportunities but at the same time challenges that need to be carefully considered. This doctoral thesis investigates the usability of vector symbolic architectures on IMC hardware. It entails the following key contributions: 1) Design of VSA and neuro-VSA architectures for several applications, namely categorical and numerical sequence encoding, few-shot learning, and few-shot continual learning. In each application, the models are engineered by taking the strengths and limitations of in-memory computing architecture into consideration. 2) Simulation of the architectures using IMC models and experiment with IMC prototype chips to characterize the model performance in terms of accuracy and robustness. 3) Design of full systems with CMOS peripherals complementing the IMC tiles and estimate the power, performance, and area of the proposed systems. The following is a brief overview of what is presented in each chapter. Chapter 1 provides the background and motivation for implementing VSA models on IMC hardware. Chapter 2 introduces the concept of in-memory hyperdimensional computing, where it is shown how VSA modules can be mapped to IMC hardware taking categorical symbol sequence classification as an example. Chapter 3 generalizes this idea to include encoding for spatio-temporal signals which, encompass not only categorical but also numerical symbols. To directly work with raw data, in Chapter 4, a novel neuro-VSA architecture and training methodology are proposed with an application for the few-shot learning problem, backed up with simulation and experimental results from IMC models and real IMC hardware, respectively. Chapter 5 and Chapter 6 present an extended version of the neuro-VSA architecture for the few-shot continual learning problem with an emphasis on software and hardware innovations respectively. Finally, Chapter 7, provides conclusions to the thesis and outlook for future research directions
    corecore