25 research outputs found

    All-rounder: A flexible DNN accelerator with diverse data format support

    Full text link
    Recognizing the explosive increase in the use of DNN-based applications, several industrial companies developed a custom ASIC (e.g., Google TPU, IBM RaPiD, Intel NNP-I/NNP-T) and constructed a hyperscale cloud infrastructure with it. The ASIC performs operations of the inference or training process of DNN models which are requested by users. Since the DNN models have different data formats and types of operations, the ASIC needs to support diverse data formats and generality for the operations. However, the conventional ASICs do not fulfill these requirements. To overcome the limitations of it, we propose a flexible DNN accelerator called All-rounder. The accelerator is designed with an area-efficient multiplier supporting multiple precisions of integer and floating point datatypes. In addition, it constitutes a flexibly fusible and fissionable MAC array to support various types of DNN operations efficiently. We implemented the register transfer level (RTL) design using Verilog and synthesized it in 28nm CMOS technology. To examine practical effectiveness of our proposed designs, we designed two multiply units and three state-of-the-art DNN accelerators. We compare our multiplier with the multiply units and perform architectural evaluation on performance and energy efficiency with eight real-world DNN models. Furthermore, we compare benefits of the All-rounder accelerator to a high-end GPU card, i.e., NVIDIA GeForce RTX30390. The proposed All-rounder accelerator universally has speedup and high energy efficiency in various DNN benchmarks than the baselines

    LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training

    Full text link
    When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.Comment: The paper is going to appearin the 40th IEEE International Conference on Computer Design (ICCD), 202

    Defending Against Flush+Reload Attack With DRAM Cache by Bypassing Shared SRAM Cache

    Get PDF
    Cache side-channel attack is one of the critical security threats to modern computing systems. As a representative cache side-channel attack, Flush+Reload attack allows an attacker to steal confidential information (e.g., private encryption key) by monitoring a victim's cache access patterns while generating the confidential values. Meanwhile, for providing high performance with memory-intensive applications that do not fit in the on-chip SRAM-based last-level cache (e.g., L3 cache), modern computing systems start to deploy DRAM cache between the SRAM-based last-level cache and the main memory DRAM, which can provide low latency and/or high bandwidth. However, in this work, we propose an approach that exploits the DRAM cache for security rather than performance, called ByCA. ByCA bypasses the L3 shared cache when accessing cache blocks suspected as target blocks of an attacker. Consequently, ByCA eliminates the timing difference when the attacker accesses the target cache blocks, nullifying the Flush+Reload attacks. To this end, ByCA keeps cache blocks suspected as target blocks of the attacker and stores their states (i.e., flushed by clflush or not) in the L4 DRAM cache even with clflush instruction; ByCA re-defines and re-implements clflush instruction not to flush cache blocks from the L4 DRAM cache while flushing the blocks from other level caches (i.e., L1, L2, and L3 caches). In addition, ByCA bypasses L3 cache when the attacker or the victim accesses the target blocks flushed by clflush, making the attacker always obtain the blocks from L4 DRAM cache regardless of the victim's access patterns. Consequently, ByCA eliminates the timing difference, thus the attacker cannot monitor the victim's cache access patterns. For L4 DRAM cache, we implement Alloy Cache design and use an unused bit in a tag entry for each block to store its state. ByCA only requires a single bit extension to cache blocks in L1 and L2 private caches, and a tag entry for each block in the L4 DRAM cache. Our experimental results show that ByCA completely eliminates the timing differences when the attacker reloads the target blocks. Furthermore, ByCA does not show the performance degradation for the victim while co-running with the attacker that flushes and reloads target blocks temporally and repetitively.1

    ReplaceNet: real-time replacement of a biological neural circuit with a hardware-assisted spiking neural network

    Get PDF
    Recent developments in artificial neural networks and their learning algorithms have enabled new research directions in computer vision, language modeling, and neuroscience. Among various neural network algorithms, spiking neural networks (SNNs) are well-suited for understanding the behavior of biological neural circuits. In this work, we propose to guide the training of a sparse SNN in order to replace a sub-region of a cultured hippocampal network with limited hardware resources. To verify our approach with a realistic experimental setup, we record spikes of cultured hippocampal neurons with a microelectrode array (in vitro). The main focus of this work is to dynamically cut unimportant synapses during SNN training on the fly so that the model can be realized on resource-constrained hardware, e.g., implantable devices. To do so, we adopt a simple STDP learning rule to easily select important synapses that impact the quality of spike timing learning. By combining the STDP rule with online supervised learning, we can precisely predict the spike pattern of the cultured network in real-time. The reduction in the model complexity, i.e., the reduced number of connections, significantly reduces the required hardware resources, which is crucial in developing an implantable chip for the treatment of neurological disorders. In addition to the new learning algorithm, we prototype a sparse SNN hardware on a small FPGA with pipelined execution and parallel computing to verify the possibility of real-time replacement. As a result, we can replace a sub-region of the biological neural circuit within 22 μs using 2.5 × fewer hardware resources, i.e., by allowing 80% sparsity in the SNN model, compared to the fully-connected SNN model. With energy-efficient algorithms and hardware, this work presents an essential step toward real-time neuroprosthetic computation

    Simplified Compressor and Encoder Designs for Low-Cost Approximate Radix-4 Booth Multiplier

    No full text
    IEEEIn this brief, we present a novel design methodology of cost-effective approximate radix-4 Booth multipliers, which can significantly reduce the power consumption of error-resilient signal processing tasks. In contrast that the prior studies only focus on the approximation of either the partial product generation with encoders or the partial product reductions with compressors, the proposed method considers two major processing steps jointly by forcing the generated error directions to be opposite to each other. As the internal errors are naturally balanced to have zero mean, as a result, the proposed approximate Booth multiplier can minimize the required processing energy under the same number of approximate bits compared to the previous designs. Simulation results on FIR filtering and image classification applications reveal that the proposed approximate Booth multiplier shows the most attractive energy-performance trade-offs, achieving 28% and 34% of energy reduction compared to the exact Booth multiplier, respectively, with negligible accuracy loss.11Nsciescopu

    Towards Scalable Analytics with Inference-Enabled Solid-State Drives

    No full text
    In this paper, we propose a novel storage architecture, called an Inference-Enabled SSD (IESSD), which employs FPGA-based DNN inference accelerators inside an SSD. IESSD is capable of performing DNN operations inside an SSD, avoiding frequent data movements between application servers and data storage. This boosts up analytics performance of DNN applications. Moreover, by placing accelerators near data within an SSD, IESSD delivers scalable analytics performance which improves with the amount of data to analyze. To evaluate its effectiveness, we implement an FPGA-based proof-of-concept prototype of IESSD and carry out a case study with an image tagging (classification) application. Our preliminary results show that IESSD exhibits 1.81x better performance, achieving 5.31x lower power consumption, over a conventional system with GPU accelerators.1

    Design and Analysis of Approximate Compressors for Balanced Error Accumulation in MAC Operator

    No full text
    In this paper, we present a novel approximate computing scheme suitable for realizing the energy-efficient multiply-accumulate (MAC) processing. In contrast to the prior works that suffer from the error accumulation limiting the approximate range, we utilize different approximate multipliers in an interleaved way to compensate errors in the opposite direction during accumulate operations. For the balanced error accumulation, we first design the approximate 4-2 compressors generating errors in the opposite direction while minimizing the computational costs. Based on the probabilistic analysis, positive and negative multipliers are then carefully developed to provide a similar error distance. Simulation results on various practical applications reveal that the proposed MAC processing offers the energy-efficient computing scenario by extending the range of approximate parts. Even compared to the state-of-the-art solutions, for example, the proposed interleaving scheme relaxes the core-level energy consumption of the recent CNN accelerator by more than 35% without degrading the recognition accuracy.11Nsciescopu

    Adaptive input-to-neuron interlink development in training of spike-based liquid state machines

    No full text
    In this paper, we present a novel approach in developing input-to-neuron interlinks to achieve better accuracy in spike-based liquid state machines. An energy-efficient Spiking Neural Network suffer from lower accuracy in image classification compared to deep learning models. The previous LSM models randomly connect input neurons to excitatory neurons in a liquid. This limits the expressive power of a liquid model as large portion of excitatory neurons become inactive which never fire. To overcome this limitation, we propose an adaptive interlink development method which achieves 3.2% higher classification accuracy than the static LSM model of 3,200 neurons. Also, our hardware implementation on FPGA improves performance by 3.16∼4.99× or 1.47∼3.95× over CPU/GPU. © 2021 IEE

    Adaptive Precision Cellular Nonlinear Network

    No full text

    AutoRelax: HW-SW Co-Optimization for Efficient SpGEMM Operations with Automated Relaxation in Deep Learning

    No full text
    IEEEWe propose a HW-SW co-optimization technique to perform energy-efficient spGEMM operations for deep learning. First, we present an automated pruning algorithm, named AutoRelax, that allows some level of relaxation to achieve higher compression ratio. Since the benefit of the proposed pruning algorithm may be limited by the sparsity level of a given weight matrix, we present additional steps to further improve its efficiency. Along with the software approach, we also present a hardware architecture for processing sparse GEMM operations to maximize the benefit of the proposed pruning algorithm and sparse matrix format. To validate the efficiency of our co-optimization methodology, we evaluated the proposed method on three benchmarks, language modeling, speech recognition and image classification. As a result, our approach improved on-chip performance of spGEMM operations by 9.5027.57% and achieved energy reductions of 15.3533.28% considering DRAM accesses over other sparse accelerators.11Nsciescopu
    corecore