Search CORE

102 research outputs found

A Scalable Pipelined Dataflow Accelerator for Object Region Proposals on FPGA Platform

Author: Chen Yiran
Dai Pengcheng
Fu Wenzhi
Yang Jianlei
Zhao Weisheng
Publication venue
Publication date: 26/10/2018
Field of study

Region proposal is critical for object detection while it usually poses a bottleneck in improving the computation efficiency on traditional control-flow architectures. We have observed region proposal tasks are potentially suitable for performing pipelined parallelism by exploiting dataflow driven acceleration. In this paper, a scalable pipelined dataflow accelerator is proposed for efficient region proposals on FPGA platform. The accelerator processes image data by a streaming manner with three sequential stages: resizing, kernel computing and sorting. First, Ping-Pong cache strategy is adopted for rotation loading in resize module to guarantee continuous output streaming. Then, a multiple pipelines architecture with tiered memory is utilized in kernel computing module to complete the main computation tasks. Finally, a bubble-pushing heap sort method is exploited in sorting module to find the top-k largest candidates efficiently. Our design is implemented with high level synthesis on FPGA platforms, and experimental results on VOC2007 datasets show that it could achieve about 3.67X speedups than traditional desktop CPU platform and >250X energy efficiency improvement than embedded ARM platform.Comment: accepted by FPT 2018 Conferenc

arXiv.org e-Print Archive

Spintronics based Stochastic Computing for Efficient Bayesian Inference System

Author: Chen Yiran
Hai
Jia Xiaotao
Li
Wang Zhaohao
Yang Jianlei
Zhao Weisheng
Publication venue
Publication date: 03/11/2017
Field of study

Bayesian inference is an effective approach for solving statistical learning problems especially with uncertainty and incompleteness. However, inference efficiencies are physically limited by the bottlenecks of conventional computing platforms. In this paper, an emerging Bayesian inference system is proposed by exploiting spintronics based stochastic computing. A stochastic bitstream generator is realized as the kernel components by leveraging the inherent randomness of spintronics devices. The proposed system is evaluated by typical applications of data fusion and Bayesian belief networks. Simulation results indicate that the proposed approach could achieve significant improvement on inference efficiencies in terms of power consumption and inference speed.Comment: accepted by ASPDAC 2018 conferenc

arXiv.org e-Print Archive

Hardware Security in Spin-Based Computing-In-Memory: Analysis, Exploits, and Mitigation Techniques

Author: Jia Xiaotao
Qu Gang
Wang Xueyan
Yang Jianlei
Zhao Weisheng
Zhao Yinglin
Publication venue
Publication date: 02/06/2020
Field of study

Computing-in-memory (CIM) is proposed to alleviate the processor-memory data transfer bottleneck in traditional Von-Neumann architectures, and spintronics-based magnetic memory has demonstrated many facilitation in implementing CIM paradigm. Since hardware security has become one of the major concerns in circuit designs, this paper, for the first time, investigates spin-based computing-in-memory (SpinCIM) from a security perspective. We focus on two fundamental questions: 1) how the new SpinCIM computing paradigm can be exploited to enhance hardware security? 2) what security concerns has this new SpinCIM computing paradigm incurred?Comment: accepted by ACM Journal on Emerging Technologies in Computing Systems (JETC

arXiv.org e-Print Archive

ELFISH: Resource-Aware Federated Learning on Heterogeneous Edge Devices

Author: Chen Xiang
Xiong Jinjun
Xu Zirui
Yang Jianlei
Yang Zhao
Publication venue
Publication date: 03/12/2019
Field of study

In this work, we propose ELFISH - a resource-aware federated learning framework to tackle computation stragglers in federated learning. In ELFISH, neural network models' training consumption will be firstly profiled in terms of different computation resources. Guided by profiling, a "soft-training" method is proposed for straggler acceleration, which partially trains the model by masking a particular number of resource-intensive neurons. Rather than generating a deterministically optimized model with diverged structure, different sets of neurons will be dynamically masked every training cycle and will be recovered and updated during parameter aggregation, ensuring comprehensive model updates overtime. The corresponding parameter aggregation scheme is also proposed to balance the contribution from soft-trained models and guarantee the collaborative convergence. Eventually, ELFISH overcomes the computational heterogeneity of edge devices and achieves synchronized collaboration without computational stragglers. Experiments show that ELFISH can provide up to 2x training acceleration with soft-training in various straggler settings. Furthermore, benefited from the proposed parameter aggregation scheme, ELFISH improves the model accuracy for 4% with even better collaborative convergence robustness.Comment: 6 pages, 5 figure

arXiv.org e-Print Archive

SPINBIS: Spintronics based Bayesian Inference System with Stochastic Computing

Author: Chen Yiran
Dai Pengcheng
Jia Xiaotao
Liu Runze
Yang Jianlei
Zhao Weisheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2019
Field of study

Bayesian inference is an effective approach for solving statistical learning problems, especially with uncertainty and incompleteness. However, Bayesian inference is a computing-intensive task whose efficiency is physically limited by the bottlenecks of conventional computing platforms. In this work, a spintronics based stochastic computing approach is proposed for efficient Bayesian inference. The inherent stochastic switching behaviors of spintronic devices are exploited to build stochastic bitstream generator (SBG) for stochastic computing with hybrid CMOS/MTJ circuits design. Aiming to improve the inference efficiency, an SBG sharing strategy is leveraged to reduce the required SBG array scale by integrating a switch network between SBG array and stochastic computing logic. A device-to-architecture level framework is proposed to evaluate the performance of spintronics based Bayesian inference system (SPINBIS). Experimental results on data fusion applications have shown that SPINBIS could improve the energy efficiency about 12X than MTJ-based approach with 45% design area overhead and about 26X than FPGA-based approach.Comment: 14 pages, 26 figures, accepted by IEEE Transactions on Computer-Aided Design of Integrated Circuits and System

arXiv.org e-Print Archive

S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks

Author: Cheng Xingzhou
Dai Pengcheng
Fu Wenzhi
Yang Jianlei
Ye Xucheng
Zhao Weisheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/06/2021
Field of study

Convolutional neural networks (CNNs) have achieved great success in performing cognitive tasks. However, execution of CNNs requires a large amount of computing resources and generates heavy memory traffic, which imposes a severe challenge on computing system design. Through optimizing parallel executions and data reuse in convolution, systolic architecture demonstrates great advantages in accelerating CNN computations. However, regular internal data transmission path in traditional systolic architecture prevents the systolic architecture from completely leveraging the benefits introduced by neural network sparsity. Deployment of fine-grained sparsity on the existing systolic architectures is greatly hindered by the incurred computational overheads. In this work, we propose S2Engine

-

a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naive systolic array, S2Engine achieves about

3.2\times

and about

3.0\times

improvements on speed and energy efficiency, respectively.Comment: 13 pages, 17 figure

arXiv.org e-Print Archive

Efficient Computation Reduction in Bayesian Neural Networks Through Feature Decomposition and Memorization

Author: Cotofana Sorin Dan
Jia Xiaotao
Liu Runze
Wang Xueyan
Yang Jianlei
Zhao Weisheng
Publication venue
Publication date: 08/05/2020
Field of study

Bayesian method is capable of capturing real world uncertainties/incompleteness and properly addressing the over-fitting issue faced by deep neural networks. In recent years, Bayesian Neural Networks (BNNs) have drawn tremendous attentions of AI researchers and proved to be successful in many applications. However, the required high computation complexity makes BNNs difficult to be deployed in computing systems with limited power budget. In this paper, an efficient BNN inference flow is proposed to reduce the computation cost then is evaluated by means of both software and hardware implementations. A feature decomposition and memorization (\texttt{DM}) strategy is utilized to reform the BNN inference flow in a reduced manner. About half of the computations could be eliminated compared to the traditional approach that has been proved by theoretical analysis and software validations. Subsequently, in order to resolve the hardware resource limitations, a memory-friendly computing framework is further deployed to reduce the memory overhead introduced by \texttt{DM} strategy. Finally, we implement our approach in Verilog and synthesise it with 45

nm

FreePDK technology. Hardware simulation results on multi-layer BNNs demonstrate that, when compared with the traditional BNN inference method, it provides an energy consumption reduction of 73\% and a 4

\times

speedup at the expense of 14\% area overhead.Comment: accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS

arXiv.org e-Print Archive

Exploiting Spin-Orbit Torque Devices as Reconfigurable Logic for Circuit Obfuscation

Author: Chen Yiran
Hai
Li
Wang Xueyan
Wang Zhaohao
Yang Jianlei
Zhao Weisheng
Zhou Qiang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/02/2018
Field of study

Circuit obfuscation is a frequently used approach to conceal logic functionalities in order to prevent reverse engineering attacks on fabricated chips. Efficient obfuscation implementations are expected with lower design complexity and overhead but higher attack difficulties. In this paper, an emerging obfuscation approach is proposed by leveraging spinorbit torque (SOT) devices based look-up-tables (LUTs) as reconfigurable logic to replace the carefully selected gates. It is essentially impossible to identify the obfuscated gate with SOTs inside according to the physical geometry characteristics because the configured functionalities are represented by magnetization states. Such an obfuscation approach makes the circuit security further improved with high exponential attack complexities. Experiments on MCNC and ISCAS 85/89 benchmark suits show that the proposed approach could reduce the area overheads due to obfuscation by 10% averagely.Comment: 14 pages, 21 figure

arXiv.org e-Print Archive

NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration

Author: Cheng Xingzhou
Jia Xiaotao
Li Bing
Wang Xueyan
Wang Zhaohao
Yang Jianlei
Ye Xucheng
Zhang Youguang
Zhao Weisheng
Zhao Yinglin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/04/2022
Field of study

The performance and efficiency of running large-scale datasets on traditional computing systems exhibit critical bottlenecks due to the existing "power wall" and "memory wall" problems. To resolve those problems, processing-in-memory (PIM) architectures are developed to bring computation logic in or near memory to alleviate the bandwidth limitations during data transmission. NAND-like spintronics memory (NAND-SPIN) is one kind of promising magnetoresistive random-access memory (MRAM) with low write energy and high integration density, and it can be employed to perform efficient in-memory computation operations. In this work, we propose a NAND-SPIN-based PIM architecture for efficient convolutional neural network (CNN) acceleration. A straightforward data mapping scheme is exploited to improve the parallelism while reducing data movements. Benefiting from the excellent characteristics of NAND-SPIN and in-memory processing architecture, experimental results show that the proposed approach can achieve

\sim

2.6

\times

speedup and

\sim

1.4

\times

improvement in energy efficiency over state-of-the-art PIM solutions.Comment: 15 pages, accepted by SCIENCE CHINA Information Sciences (SCIS) 202

arXiv.org e-Print Archive

TCIM: Triangle Counting Acceleration With Processing-In-MRAM Architecture

Author: Chen Xiaoming
Cheng Xingzhou
Jia Xiaotao
Liu Meichen
Qi Yingjie
Qu Gang
Wang Xueyan
Yang Jianlei
Zhao Weisheng
Zhao Yinglin
Publication venue
Publication date: 21/07/2020
Field of study

Triangle counting (TC) is a fundamental problem in graph analysis and has found numerous applications, which motivates many TC acceleration solutions in the traditional computing platforms like GPU and FPGA. However, these approaches suffer from the bandwidth bottleneck because TC calculation involves a large amount of data transfers. In this paper, we propose to overcome this challenge by designing a TC accelerator utilizing the emerging processing-in-MRAM (PIM) architecture. The true innovation behind our approach is a novel method to perform TC with bitwise logic operations (such as \texttt{AND}), instead of the traditional approaches such as matrix computations. This enables the efficient in-memory implementations of TC computation, which we demonstrate in this paper with computational Spin-Transfer Torque Magnetic RAM (STT-MRAM) arrays. Furthermore, we develop customized graph slicing and mapping techniques to speed up the computation and reduce the energy consumption. We use a device-to-architecture co-simulation framework to validate our proposed TC accelerator. The results show that our data mapping strategy could reduce

99.99\%

of the computation and

72\%

of the memory \texttt{WRITE} operations. Compared with the existing GPU or FPGA accelerators, our in-memory accelerator achieves speedups of

9\times

and

23.4\times

, respectively, and a

20.6\times

energy efficiency improvement over the FPGA accelerator.Comment: published on DAC 202

arXiv.org e-Print Archive