87 research outputs found

    A Scalable Pipelined Dataflow Accelerator for Object Region Proposals on FPGA Platform

    Full text link
    Region proposal is critical for object detection while it usually poses a bottleneck in improving the computation efficiency on traditional control-flow architectures. We have observed region proposal tasks are potentially suitable for performing pipelined parallelism by exploiting dataflow driven acceleration. In this paper, a scalable pipelined dataflow accelerator is proposed for efficient region proposals on FPGA platform. The accelerator processes image data by a streaming manner with three sequential stages: resizing, kernel computing and sorting. First, Ping-Pong cache strategy is adopted for rotation loading in resize module to guarantee continuous output streaming. Then, a multiple pipelines architecture with tiered memory is utilized in kernel computing module to complete the main computation tasks. Finally, a bubble-pushing heap sort method is exploited in sorting module to find the top-k largest candidates efficiently. Our design is implemented with high level synthesis on FPGA platforms, and experimental results on VOC2007 datasets show that it could achieve about 3.67X speedups than traditional desktop CPU platform and >250X energy efficiency improvement than embedded ARM platform.Comment: accepted by FPT 2018 Conferenc

    Spintronics based Stochastic Computing for Efficient Bayesian Inference System

    Full text link
    Bayesian inference is an effective approach for solving statistical learning problems especially with uncertainty and incompleteness. However, inference efficiencies are physically limited by the bottlenecks of conventional computing platforms. In this paper, an emerging Bayesian inference system is proposed by exploiting spintronics based stochastic computing. A stochastic bitstream generator is realized as the kernel components by leveraging the inherent randomness of spintronics devices. The proposed system is evaluated by typical applications of data fusion and Bayesian belief networks. Simulation results indicate that the proposed approach could achieve significant improvement on inference efficiencies in terms of power consumption and inference speed.Comment: accepted by ASPDAC 2018 conferenc

    Hardware Security in Spin-Based Computing-In-Memory: Analysis, Exploits, and Mitigation Techniques

    Full text link
    Computing-in-memory (CIM) is proposed to alleviate the processor-memory data transfer bottleneck in traditional Von-Neumann architectures, and spintronics-based magnetic memory has demonstrated many facilitation in implementing CIM paradigm. Since hardware security has become one of the major concerns in circuit designs, this paper, for the first time, investigates spin-based computing-in-memory (SpinCIM) from a security perspective. We focus on two fundamental questions: 1) how the new SpinCIM computing paradigm can be exploited to enhance hardware security? 2) what security concerns has this new SpinCIM computing paradigm incurred?Comment: accepted by ACM Journal on Emerging Technologies in Computing Systems (JETC

    ELFISH: Resource-Aware Federated Learning on Heterogeneous Edge Devices

    Full text link
    In this work, we propose ELFISH - a resource-aware federated learning framework to tackle computation stragglers in federated learning. In ELFISH, neural network models' training consumption will be firstly profiled in terms of different computation resources. Guided by profiling, a "soft-training" method is proposed for straggler acceleration, which partially trains the model by masking a particular number of resource-intensive neurons. Rather than generating a deterministically optimized model with diverged structure, different sets of neurons will be dynamically masked every training cycle and will be recovered and updated during parameter aggregation, ensuring comprehensive model updates overtime. The corresponding parameter aggregation scheme is also proposed to balance the contribution from soft-trained models and guarantee the collaborative convergence. Eventually, ELFISH overcomes the computational heterogeneity of edge devices and achieves synchronized collaboration without computational stragglers. Experiments show that ELFISH can provide up to 2x training acceleration with soft-training in various straggler settings. Furthermore, benefited from the proposed parameter aggregation scheme, ELFISH improves the model accuracy for 4% with even better collaborative convergence robustness.Comment: 6 pages, 5 figure

    SPINBIS: Spintronics based Bayesian Inference System with Stochastic Computing

    Full text link
    Bayesian inference is an effective approach for solving statistical learning problems, especially with uncertainty and incompleteness. However, Bayesian inference is a computing-intensive task whose efficiency is physically limited by the bottlenecks of conventional computing platforms. In this work, a spintronics based stochastic computing approach is proposed for efficient Bayesian inference. The inherent stochastic switching behaviors of spintronic devices are exploited to build stochastic bitstream generator (SBG) for stochastic computing with hybrid CMOS/MTJ circuits design. Aiming to improve the inference efficiency, an SBG sharing strategy is leveraged to reduce the required SBG array scale by integrating a switch network between SBG array and stochastic computing logic. A device-to-architecture level framework is proposed to evaluate the performance of spintronics based Bayesian inference system (SPINBIS). Experimental results on data fusion applications have shown that SPINBIS could improve the energy efficiency about 12X than MTJ-based approach with 45% design area overhead and about 26X than FPGA-based approach.Comment: 14 pages, 26 figures, accepted by IEEE Transactions on Computer-Aided Design of Integrated Circuits and System

    S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks

    Full text link
    Convolutional neural networks (CNNs) have achieved great success in performing cognitive tasks. However, execution of CNNs requires a large amount of computing resources and generates heavy memory traffic, which imposes a severe challenge on computing system design. Through optimizing parallel executions and data reuse in convolution, systolic architecture demonstrates great advantages in accelerating CNN computations. However, regular internal data transmission path in traditional systolic architecture prevents the systolic architecture from completely leveraging the benefits introduced by neural network sparsity. Deployment of fine-grained sparsity on the existing systolic architectures is greatly hindered by the incurred computational overheads. In this work, we propose S2Engine −- a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naive systolic array, S2Engine achieves about 3.2×3.2\times and about 3.0×3.0\times improvements on speed and energy efficiency, respectively.Comment: 13 pages, 17 figure

    Efficient Computation Reduction in Bayesian Neural Networks Through Feature Decomposition and Memorization

    Full text link
    Bayesian method is capable of capturing real world uncertainties/incompleteness and properly addressing the over-fitting issue faced by deep neural networks. In recent years, Bayesian Neural Networks (BNNs) have drawn tremendous attentions of AI researchers and proved to be successful in many applications. However, the required high computation complexity makes BNNs difficult to be deployed in computing systems with limited power budget. In this paper, an efficient BNN inference flow is proposed to reduce the computation cost then is evaluated by means of both software and hardware implementations. A feature decomposition and memorization (\texttt{DM}) strategy is utilized to reform the BNN inference flow in a reduced manner. About half of the computations could be eliminated compared to the traditional approach that has been proved by theoretical analysis and software validations. Subsequently, in order to resolve the hardware resource limitations, a memory-friendly computing framework is further deployed to reduce the memory overhead introduced by \texttt{DM} strategy. Finally, we implement our approach in Verilog and synthesise it with 45 nmnm FreePDK technology. Hardware simulation results on multi-layer BNNs demonstrate that, when compared with the traditional BNN inference method, it provides an energy consumption reduction of 73\% and a 4×\times speedup at the expense of 14\% area overhead.Comment: accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS

    Exploiting Spin-Orbit Torque Devices as Reconfigurable Logic for Circuit Obfuscation

    Full text link
    Circuit obfuscation is a frequently used approach to conceal logic functionalities in order to prevent reverse engineering attacks on fabricated chips. Efficient obfuscation implementations are expected with lower design complexity and overhead but higher attack difficulties. In this paper, an emerging obfuscation approach is proposed by leveraging spinorbit torque (SOT) devices based look-up-tables (LUTs) as reconfigurable logic to replace the carefully selected gates. It is essentially impossible to identify the obfuscated gate with SOTs inside according to the physical geometry characteristics because the configured functionalities are represented by magnetization states. Such an obfuscation approach makes the circuit security further improved with high exponential attack complexities. Experiments on MCNC and ISCAS 85/89 benchmark suits show that the proposed approach could reduce the area overheads due to obfuscation by 10% averagely.Comment: 14 pages, 21 figure

    NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration

    Full text link
    The performance and efficiency of running large-scale datasets on traditional computing systems exhibit critical bottlenecks due to the existing "power wall" and "memory wall" problems. To resolve those problems, processing-in-memory (PIM) architectures are developed to bring computation logic in or near memory to alleviate the bandwidth limitations during data transmission. NAND-like spintronics memory (NAND-SPIN) is one kind of promising magnetoresistive random-access memory (MRAM) with low write energy and high integration density, and it can be employed to perform efficient in-memory computation operations. In this work, we propose a NAND-SPIN-based PIM architecture for efficient convolutional neural network (CNN) acceleration. A straightforward data mapping scheme is exploited to improve the parallelism while reducing data movements. Benefiting from the excellent characteristics of NAND-SPIN and in-memory processing architecture, experimental results show that the proposed approach can achieve ∼\sim2.6×\times speedup and ∼\sim1.4×\times improvement in energy efficiency over state-of-the-art PIM solutions.Comment: 15 pages, accepted by SCIENCE CHINA Information Sciences (SCIS) 202

    SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

    Full text link
    Training Convolutional Neural Networks (CNNs) usually requires a large number of computational resources. In this paper, \textit{SparseTrain} is proposed to accelerate CNN training by fully exploiting the sparsity. It mainly involves three levels of innovations: activation gradients pruning algorithm, sparse training dataflow, and accelerator architecture. By applying a stochastic pruning algorithm on each layer, the sparsity of back-propagation gradients can be increased dramatically without degrading training accuracy and convergence rate. Moreover, to utilize both \textit{natural sparsity} (resulted from ReLU or Pooling layers) and \textit{artificial sparsity} (brought by pruning algorithm), a sparse-aware architecture is proposed for training acceleration. This architecture supports forward and back-propagation of CNN by adopting 1-Dimensional convolution dataflow. We have built %a simple compiler to map CNNs topology onto \textit{SparseTrain}, and a cycle-accurate architecture simulator to evaluate the performance and efficiency based on the synthesized design with 14nm14nm FinFET technologies. Evaluation results on AlexNet/ResNet show that \textit{SparseTrain} could achieve about 2.7×2.7 \times speedup and 2.2×2.2 \times energy efficiency improvement on average compared with the original training process.Comment: published on DAC 202
    • …
    corecore