168 research outputs found

    Optimization Opportunities in RRAM-based FPGA Architectures

    Get PDF
    Static Random Access Memory (SRAM)-based routing multiplexers, whatever structure is employed, share a common limitation: their area, delay and power increase linearly with the input size. This property results in most SRAM-based FPGA architectures typically avoiding the use of large multiplexers. Resistive Random Access Memory (RRAM)- based multiplexers, built with one-level structure, have a unique advantage over SRAM-based multiplexers: their ideal delay is independent from the input size. This property allows RRAM-based FPGA architectures to use larger multiplexers than their SRAM-based counterparts, without generating any delay overhead. In this paper, by carefully considering the properties of RRAM multiplexers, we assess that current state-of-art architectural parameters for SRAM-based FPGAs cannot preserve optimality in the context of RRAM-based FPGAs. As a result, we propose that in RRAM-based FPGAs, (a) the routing tracks should be interconnected to Look-Up Table (LUT) inputs via a one-level crossbar, instead of through Connection Blocks and local routing; (b) the Switch Blocks should employ larger multiplexers; (c) length-2 wires should be used instead of length-4 wires. When operated in nominal voltage, the proposed RRAM-based FPGA architecture reduces area by 26%, delay by 39% and channel width by 13%, as compared to a SRAM-based FPGA with a classical architecture. When operated in the near-Vt regime, the proposed RRAM-based FPGA architecture improves Area-Delay Product by 42% and Power-Delay Product by 5x as compared to a classical SRAM-based FPGA at nominal voltage

    Simulation and implementation of novel deep learning hardware architectures for resource constrained devices

    Get PDF
    Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems

    A Ultra-Low-Power FPGA Based on Monolithically Integrated RRAMs (invited)

    Get PDF
    Field Programmable Gate Arrays (FPGAs) rely heavily on complex routing architectures. The routing structures use programmable switches and account for a significant share in the total area, delay and power consumption numbers. With the ability of being monolithically integrated with CMOS chips, Resistive Random Access Memories (RRAMs) enable high-performance routing architectures through the replacement of Static Random Access Memory (SRAM)-based programming switches. Exploiting the very low on-resistance state achievable by RRAMs as well as the improved tolerance to power supply reduction, RRAM-based routing multiplexers can be used to significantly reduce the power consumption of FPGA systems with no performance compromises. By evaluating the opportunities of ultra-low-power RRAM-based FPGAs at the system level, we see an improvement of 12%, 26% and 81% in area, delay and power consumption at a mature technology node

    Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications

    Get PDF
    With the advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors, new opportunities are emerging for applying deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of the medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies ranging from emerging memristive devices, to established Field Programmable Gate Arrays (FPGAs), and mature Complementary Metal Oxide Semiconductor (CMOS) technology can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. After providing the required background, we unify the sparsely distributed research on neural network and neuromorphic hardware implementations as applied to the healthcare domain. In addition, we benchmark various hardware platforms by performing a biomedical electromyography (EMG) signal processing task and drawing comparisons among them in terms of inference delay and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that different accelerators and neuromorphic processors introduce to healthcare and biomedical domains. This paper can serve a large audience, ranging from nanoelectronics researchers, to biomedical and healthcare practitioners in grasping the fundamental interplay between hardware, algorithms, and clinical adoption of these tools, as we shed light on the future of deep networks and spiking neuromorphic processing systems as proponents for driving biomedical circuits and systems forward.Comment: Submitted to IEEE Transactions on Biomedical Circuits and Systems (21 pages, 10 figures, 5 tables

    Circuit Design, Architecture and CAD for RRAM-based FPGAs

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have been indispensable components of embedded systems and datacenter infrastructures. However, energy efficiency of FPGAs has become a hard barrier preventing their expansion to more application contexts, due to two physical limitations: (1) The massive usage of routing multiplexers causes delay and power overheads as compared to ASICs. To reduce their power consumption, FPGAs have to operate at low supply voltage but sacrifice performance because the transistors drive degrade when working voltage decreases. (2) Using volatile memory technology forces FPGAs to lose configurations when powered off and to be reconfigured at each power on. Resistive Random Access Memories (RRAMs) have strong potentials in overcoming the physical limitations of conventional FPGAs. First of all, RRAMs grant FPGAs non-volatility, enabling FPGAs to be "Normally powered off, Instantly powered on". Second, by combining functionality of memory and pass-gate logic in one unique device, RRAMs can greatly reduce area and delay of routing elements. Third, when RRAMs are embedded into datpaths, the performance of circuits can be independent from their working voltage, beyond the limitations of CMOS circuits. However, researches and development of RRAM-based FPGAs are in their infancy. Most of area and performance predictions were achieved without solid circuit-level simulations and sophisticated Computer Aided Design (CAD) tools, causing the predicted improvements to be less convincing. In this thesis,we present high-performance and low-power RRAM-based FPGAs fromtransistorlevel circuit designs to architecture-level optimizations and CAD tools, using theoretical analysis, industrial electrical simulators and novel CAD tools. We believe that this is the first systematic study in the field, covering: From a circuit design perspective, we propose efficient RRAM-based programming circuits and routing multiplexers through both theoretical analysis and electrical simulations. The proposed 4T(ransitor)1R(RAM) programming structure demonstrates significant improvements in programming current, when compared to most popular 2T1R programming structure. 4T1R-based routingmultiplexer designs are proposed by considering various physical design parasitics, such as intrinsic capacitance of RRAMs and wells doping organization. The proposed 4T1R-based multiplexers outperformbest CMOS implementations significantly in area, delay and power at both nominal and near-Vt regime. From a CAD perspective, we develop a generic FPGA architecture exploration tool, FPGASPICE, modeling a full FPGA fabric with SPICE and Verilog netlists. FPGA-SPICE provides different levels of testbenches and techniques to split large SPICE netlists, in order to obtain better trade-off between simulation time and accuracy. FPGA-SPICE can capture area and power characteristics of SRAM-based and RRAM-based FPGAs more accurately than the currently best analyticalmodels. From an architecture perspective, we propose architecture-level optimizations for RRAMbased FPGAs and quantify their minimumrequirements for RRAM devices. Compared to the best SRAM-based FPGAs, an optimized RRAM-based FPGA architecture brings significant reduction in area, delay and power respectively. In particular, RRAM-based FPGAs operating in the near-Vt regime demonstrate a 5x power improvement without delay overhead as compared to optimized SRAM-based FPGA operating at nominal working voltage

    SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference

    Full text link
    CNNs outperform traditional machine learning algorithms across a wide range of applications. However, their computational complexity makes it necessary to design efficient hardware accelerators. Most CNN accelerators focus on exploring dataflow styles that exploit computational parallelism. However, potential performance speedup from sparsity has not been adequately addressed. The computation and memory footprint of CNNs can be significantly reduced if sparsity is exploited in network evaluations. To take advantage of sparsity, some accelerator designs explore sparsity encoding and evaluation on CNN accelerators. However, sparsity encoding is just performed on activation or weight and only in inference. It has been shown that activation and weight also have high sparsity levels during training. Hence, sparsity-aware computation should also be considered in training. To further improve performance and energy efficiency, some accelerators evaluate CNNs with limited precision. However, this is limited to the inference since reduced precision sacrifices network accuracy if used in training. In addition, CNN evaluation is usually memory-intensive, especially in training. In this paper, we propose SPRING, a SParsity-aware Reduced-precision Monolithic 3D CNN accelerator for trainING and inference. SPRING supports both CNN training and inference. It uses a binary mask scheme to encode sparsities in activation and weight. It uses the stochastic rounding algorithm to train CNNs with reduced precision without accuracy loss. To alleviate the memory bottleneck in CNN evaluation, especially in training, SPRING uses an efficient monolithic 3D NVM interface to increase memory bandwidth. Compared to GTX 1080 Ti, SPRING achieves 15.6X, 4.2X and 66.0X improvements in performance, power reduction, and energy efficiency, respectively, for CNN training, and 15.5X, 4.5X and 69.1X improvements for inference
    • …
    corecore