102 research outputs found
A Scalable Pipelined Dataflow Accelerator for Object Region Proposals on FPGA Platform
Region proposal is critical for object detection while it usually poses a
bottleneck in improving the computation efficiency on traditional control-flow
architectures. We have observed region proposal tasks are potentially suitable
for performing pipelined parallelism by exploiting dataflow driven
acceleration. In this paper, a scalable pipelined dataflow accelerator is
proposed for efficient region proposals on FPGA platform. The accelerator
processes image data by a streaming manner with three sequential stages:
resizing, kernel computing and sorting. First, Ping-Pong cache strategy is
adopted for rotation loading in resize module to guarantee continuous output
streaming. Then, a multiple pipelines architecture with tiered memory is
utilized in kernel computing module to complete the main computation tasks.
Finally, a bubble-pushing heap sort method is exploited in sorting module to
find the top-k largest candidates efficiently. Our design is implemented with
high level synthesis on FPGA platforms, and experimental results on VOC2007
datasets show that it could achieve about 3.67X speedups than traditional
desktop CPU platform and >250X energy efficiency improvement than embedded ARM
platform.Comment: accepted by FPT 2018 Conferenc
Spintronics based Stochastic Computing for Efficient Bayesian Inference System
Bayesian inference is an effective approach for solving statistical learning
problems especially with uncertainty and incompleteness. However, inference
efficiencies are physically limited by the bottlenecks of conventional
computing platforms. In this paper, an emerging Bayesian inference system is
proposed by exploiting spintronics based stochastic computing. A stochastic
bitstream generator is realized as the kernel components by leveraging the
inherent randomness of spintronics devices. The proposed system is evaluated by
typical applications of data fusion and Bayesian belief networks. Simulation
results indicate that the proposed approach could achieve significant
improvement on inference efficiencies in terms of power consumption and
inference speed.Comment: accepted by ASPDAC 2018 conferenc
Hardware Security in Spin-Based Computing-In-Memory: Analysis, Exploits, and Mitigation Techniques
Computing-in-memory (CIM) is proposed to alleviate the processor-memory data
transfer bottleneck in traditional Von-Neumann architectures, and
spintronics-based magnetic memory has demonstrated many facilitation in
implementing CIM paradigm. Since hardware security has become one of the major
concerns in circuit designs, this paper, for the first time, investigates
spin-based computing-in-memory (SpinCIM) from a security perspective. We focus
on two fundamental questions: 1) how the new SpinCIM computing paradigm can be
exploited to enhance hardware security? 2) what security concerns has this new
SpinCIM computing paradigm incurred?Comment: accepted by ACM Journal on Emerging Technologies in Computing Systems
(JETC
ELFISH: Resource-Aware Federated Learning on Heterogeneous Edge Devices
In this work, we propose ELFISH - a resource-aware federated learning
framework to tackle computation stragglers in federated learning. In ELFISH,
neural network models' training consumption will be firstly profiled in terms
of different computation resources. Guided by profiling, a "soft-training"
method is proposed for straggler acceleration, which partially trains the model
by masking a particular number of resource-intensive neurons. Rather than
generating a deterministically optimized model with diverged structure,
different sets of neurons will be dynamically masked every training cycle and
will be recovered and updated during parameter aggregation, ensuring
comprehensive model updates overtime. The corresponding parameter aggregation
scheme is also proposed to balance the contribution from soft-trained models
and guarantee the collaborative convergence. Eventually, ELFISH overcomes the
computational heterogeneity of edge devices and achieves synchronized
collaboration without computational stragglers. Experiments show that ELFISH
can provide up to 2x training acceleration with soft-training in various
straggler settings. Furthermore, benefited from the proposed parameter
aggregation scheme, ELFISH improves the model accuracy for 4% with even better
collaborative convergence robustness.Comment: 6 pages, 5 figure
SPINBIS: Spintronics based Bayesian Inference System with Stochastic Computing
Bayesian inference is an effective approach for solving statistical learning
problems, especially with uncertainty and incompleteness. However, Bayesian
inference is a computing-intensive task whose efficiency is physically limited
by the bottlenecks of conventional computing platforms. In this work, a
spintronics based stochastic computing approach is proposed for efficient
Bayesian inference. The inherent stochastic switching behaviors of spintronic
devices are exploited to build stochastic bitstream generator (SBG) for
stochastic computing with hybrid CMOS/MTJ circuits design. Aiming to improve
the inference efficiency, an SBG sharing strategy is leveraged to reduce the
required SBG array scale by integrating a switch network between SBG array and
stochastic computing logic. A device-to-architecture level framework is
proposed to evaluate the performance of spintronics based Bayesian inference
system (SPINBIS). Experimental results on data fusion applications have shown
that SPINBIS could improve the energy efficiency about 12X than MTJ-based
approach with 45% design area overhead and about 26X than FPGA-based approach.Comment: 14 pages, 26 figures, accepted by IEEE Transactions on Computer-Aided
Design of Integrated Circuits and System
S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks
Convolutional neural networks (CNNs) have achieved great success in
performing cognitive tasks. However, execution of CNNs requires a large amount
of computing resources and generates heavy memory traffic, which imposes a
severe challenge on computing system design. Through optimizing parallel
executions and data reuse in convolution, systolic architecture demonstrates
great advantages in accelerating CNN computations. However, regular internal
data transmission path in traditional systolic architecture prevents the
systolic architecture from completely leveraging the benefits introduced by
neural network sparsity. Deployment of fine-grained sparsity on the existing
systolic architectures is greatly hindered by the incurred computational
overheads. In this work, we propose S2Engine a novel systolic architecture
that can fully exploit the sparsity in CNNs with maximized data reuse. S2Engine
transmits compressed data internally and allows each processing element to
dynamically select an aligned data from the compressed dataflow in convolution.
Compared to the naive systolic array, S2Engine achieves about and
about improvements on speed and energy efficiency, respectively.Comment: 13 pages, 17 figure
Efficient Computation Reduction in Bayesian Neural Networks Through Feature Decomposition and Memorization
Bayesian method is capable of capturing real world
uncertainties/incompleteness and properly addressing the over-fitting issue
faced by deep neural networks. In recent years, Bayesian Neural Networks (BNNs)
have drawn tremendous attentions of AI researchers and proved to be successful
in many applications. However, the required high computation complexity makes
BNNs difficult to be deployed in computing systems with limited power budget.
In this paper, an efficient BNN inference flow is proposed to reduce the
computation cost then is evaluated by means of both software and hardware
implementations. A feature decomposition and memorization (\texttt{DM})
strategy is utilized to reform the BNN inference flow in a reduced manner.
About half of the computations could be eliminated compared to the traditional
approach that has been proved by theoretical analysis and software validations.
Subsequently, in order to resolve the hardware resource limitations, a
memory-friendly computing framework is further deployed to reduce the memory
overhead introduced by \texttt{DM} strategy. Finally, we implement our approach
in Verilog and synthesise it with 45 FreePDK technology. Hardware
simulation results on multi-layer BNNs demonstrate that, when compared with the
traditional BNN inference method, it provides an energy consumption reduction
of 73\% and a 4 speedup at the expense of 14\% area overhead.Comment: accepted by IEEE Transactions on Neural Networks and Learning Systems
(TNNLS
Exploiting Spin-Orbit Torque Devices as Reconfigurable Logic for Circuit Obfuscation
Circuit obfuscation is a frequently used approach to conceal logic
functionalities in order to prevent reverse engineering attacks on fabricated
chips. Efficient obfuscation implementations are expected with lower design
complexity and overhead but higher attack difficulties. In this paper, an
emerging obfuscation approach is proposed by leveraging spinorbit torque (SOT)
devices based look-up-tables (LUTs) as reconfigurable logic to replace the
carefully selected gates. It is essentially impossible to identify the
obfuscated gate with SOTs inside according to the physical geometry
characteristics because the configured functionalities are represented by
magnetization states. Such an obfuscation approach makes the circuit security
further improved with high exponential attack complexities. Experiments on MCNC
and ISCAS 85/89 benchmark suits show that the proposed approach could reduce
the area overheads due to obfuscation by 10% averagely.Comment: 14 pages, 21 figure
NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration
The performance and efficiency of running large-scale datasets on traditional
computing systems exhibit critical bottlenecks due to the existing "power wall"
and "memory wall" problems. To resolve those problems, processing-in-memory
(PIM) architectures are developed to bring computation logic in or near memory
to alleviate the bandwidth limitations during data transmission. NAND-like
spintronics memory (NAND-SPIN) is one kind of promising magnetoresistive
random-access memory (MRAM) with low write energy and high integration density,
and it can be employed to perform efficient in-memory computation operations.
In this work, we propose a NAND-SPIN-based PIM architecture for efficient
convolutional neural network (CNN) acceleration. A straightforward data mapping
scheme is exploited to improve the parallelism while reducing data movements.
Benefiting from the excellent characteristics of NAND-SPIN and in-memory
processing architecture, experimental results show that the proposed approach
can achieve 2.6 speedup and 1.4 improvement in
energy efficiency over state-of-the-art PIM solutions.Comment: 15 pages, accepted by SCIENCE CHINA Information Sciences (SCIS) 202
TCIM: Triangle Counting Acceleration With Processing-In-MRAM Architecture
Triangle counting (TC) is a fundamental problem in graph analysis and has
found numerous applications, which motivates many TC acceleration solutions in
the traditional computing platforms like GPU and FPGA. However, these
approaches suffer from the bandwidth bottleneck because TC calculation involves
a large amount of data transfers. In this paper, we propose to overcome this
challenge by designing a TC accelerator utilizing the emerging
processing-in-MRAM (PIM) architecture. The true innovation behind our approach
is a novel method to perform TC with bitwise logic operations (such as
\texttt{AND}), instead of the traditional approaches such as matrix
computations. This enables the efficient in-memory implementations of TC
computation, which we demonstrate in this paper with computational
Spin-Transfer Torque Magnetic RAM (STT-MRAM) arrays. Furthermore, we develop
customized graph slicing and mapping techniques to speed up the computation and
reduce the energy consumption. We use a device-to-architecture co-simulation
framework to validate our proposed TC accelerator. The results show that our
data mapping strategy could reduce of the computation and of
the memory \texttt{WRITE} operations. Compared with the existing GPU or FPGA
accelerators, our in-memory accelerator achieves speedups of and
, respectively, and a energy efficiency improvement
over the FPGA accelerator.Comment: published on DAC 202
- …