208 research outputs found

    Axp: A hw-sw co-design pipeline for energy-efficient approximated convnets via associative matching

    Get PDF
    The reduction in energy consumption is key for deep neural networks (DNNs) to ensure usability and reliability, whether they are deployed on low-power end-nodes with limited resources or high-performance platforms that serve large pools of users. Leveraging the over-parametrization shown by many DNN models, convolutional neural networks (ConvNets) in particular, energy efficiency can be improved substantially preserving the model accuracy. The solution proposed in this work exploits the intrinsic redundancy of ConvNets to maximize the reuse of partial arithmetic results during the inference stages. Specifically, the weight-set of a given ConvNet is discretized through a clustering procedure such that the largest possible number of inner multiplications fall into predefined bins; this allows an off-line computation of the most frequent results, which in turn can be stored locally and retrieved when needed during the forward pass. Such a reuse mechanism leads to remarkable energy savings with the aid of a custom processing element (PE) that integrates an associative memory with a standard floating-point unit (FPU). Moreover, the adoption of an approximate associative rule based on a partial bit-match increases the hit rate over the pre-computed results, maximizing the energy reduction even further. Results collected on a set of ConvNets trained for computer vision and speech processing tasks reveal that the proposed associative-based hw-sw co-design achieves up to 77% in energy savings with less than 1% in accuracy loss

    FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

    Full text link
    Neural Network (NN) accelerators with emerging ReRAM (resistive random access memory) technologies have been investigated as one of the promising solutions to address the \textit{memory wall} challenge, due to the unique capability of \textit{processing-in-memory} within ReRAM-crossbar-based processing elements (PEs). However, the high efficiency and high density advantages of ReRAM have not been fully utilized due to the huge communication demands among PEs and the overhead of peripheral circuits. In this paper, we propose a full system stack solution, composed of a reconfigurable architecture design, Field Programmable Synapse Array (FPSA) and its software system including neural synthesizer, temporal-to-spatial mapper, and placement & routing. We highly leverage the software system to make the hardware design compact and efficient. To satisfy the high-performance communication demand, we optimize it with a reconfigurable routing architecture and the placement & routing tool. To improve the computational density, we greatly simplify the PE circuit with the spiking schema and then adopt neural synthesizer to enable the high density computation-resources to support different kinds of NN operations. In addition, we provide spiking memory blocks (SMBs) and configurable logic blocks (CLBs) in hardware and leverage the temporal-to-spatial mapper to utilize them to balance the storage and computation requirements of NN. Owing to the end-to-end software system, we can efficiently deploy existing deep neural networks to FPSA. Evaluations show that, compared to one of state-of-the-art ReRAM-based NN accelerators, PRIME, the computational density of FPSA improves by 31x; for representative NNs, its inference performance can achieve up to 1000x speedup.Comment: Accepted by ASPLOS 201

    Automated Synthesis of Memristor Crossbar Networks

    Get PDF
    The advancement of semiconductor device technology over the past decades has enabled the design of increasingly complex electrical and computational machines. Electronic design automation (EDA) has played a significant role in the design and implementation of transistor-based machines. However, as transistors move closer toward their physical limits, the speed-up provided by Moore\u27s law will grind to a halt. Once again, we find ourselves on the verge of a paradigm shift in the computational sciences as newer devices pave the way for novel approaches to computing. One of such devices is the memristor -- a resistor with non-volatile memory. Memristors can be used as junctional switches in crossbar circuits, which comprise of intersecting sets of vertical and horizontal nanowires. The major contribution of this dissertation lies in automating the design of such crossbar circuits -- doing a new kind of EDA for a new kind of computational machinery. In general, this dissertation attempts to answer the following questions: a. How can we synthesize crossbars for computing large Boolean formulas, up to 128-bit? b. How can we synthesize more compact crossbars for small Boolean formulas, up to 8-bit? c. For a given loop-free C program doing integer arithmetic, is it possible to synthesize an equivalent crossbar circuit? We have presented novel solutions to each of the above problems. Our new, proposed solutions resolve a number of significant bottlenecks in existing research, via the usage of innovative logic representation and artificial intelligence techniques. For large Boolean formulas (up to 128-bit), we have utilized Reduced Ordered Binary Decision Diagrams (ROBDDs) to automatically synthesize linearly growing crossbar circuits that compute them. This cutting edge approach towards flow-based computing has yielded state-of-the-art results. It is worth noting that this approach is scalable to n-bit Boolean formulas. We have made significant original contributions by leveraging artificial intelligence for automatic synthesis of compact crossbar circuits. This inventive method has been expanded to encompass crossbar networks with 1D1M (1-diode-1-memristor) switches, as well. The resultant circuits satisfy the tight constraints of the Feynman Grand Prize challenge and are able to perform 8-bit binary addition. A leading edge development for end-to-end computation with flow-based crossbars has been implemented, which involves methodical translation of loop-free C programs into crossbar circuits via automated synthesis. The original contributions described in this dissertation reflect the substantial progress we have made in the area of electronic design automation for synthesis of memristor crossbar networks

    Analog Spiking Neuromorphic Circuits and Systems for Brain- and Nanotechnology-Inspired Cognitive Computing

    Get PDF
    Human society is now facing grand challenges to satisfy the growing demand for computing power, at the same time, sustain energy consumption. By the end of CMOS technology scaling, innovations are required to tackle the challenges in a radically different way. Inspired by the emerging understanding of the computing occurring in a brain and nanotechnology-enabled biological plausible synaptic plasticity, neuromorphic computing architectures are being investigated. Such a neuromorphic chip that combines CMOS analog spiking neurons and nanoscale resistive random-access memory (RRAM) using as electronics synapses can provide massive neural network parallelism, high density and online learning capability, and hence, paves the path towards a promising solution to future energy-efficient real-time computing systems. However, existing silicon neuron approaches are designed to faithfully reproduce biological neuron dynamics, and hence they are incompatible with the RRAM synapses, or require extensive peripheral circuitry to modulate a synapse, and are thus deficient in learning capability. As a result, they eliminate most of the density advantages gained by the adoption of nanoscale devices, and fail to realize a functional computing system. This dissertation describes novel hardware architectures and neuron circuit designs that synergistically assemble the fundamental and significant elements for brain-inspired computing. Versatile CMOS spiking neurons that combine integrate-and-fire, passive dense RRAM synapses drive capability, dynamic biasing for adaptive power consumption, in situ spike-timing dependent plasticity (STDP) and competitive learning in compact integrated circuit modules are presented. Real-world pattern learning and recognition tasks using the proposed architecture were demonstrated with circuit-level simulations. A test chip was implemented and fabricated to verify the proposed CMOS neuron and hardware architecture, and the subsequent chip measurement results successfully proved the idea. The work described in this dissertation realizes a key building block for large-scale integration of spiking neural network hardware, and then, serves as a step-stone for the building of next-generation energy-efficient brain-inspired cognitive computing systems
    • …
    corecore