174 research outputs found
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
The hardware implementation of deep neural networks (DNNs) has recently
received tremendous attention: many applications in fact require high-speed
operations that suit a hardware implementation. However, numerous elements and
complex interconnections are usually required, leading to a large area
occupation and copious power consumption. Stochastic computing has shown
promising results for low-power area-efficient hardware implementations, even
though existing stochastic algorithms require long streams that cause long
latencies. In this paper, we propose an integer form of stochastic computation
and introduce some elementary circuits. We then propose an efficient
implementation of a DNN based on integral stochastic computing. The proposed
architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62%
average reductions in area and latency compared to the best reported
architecture in literature. We also synthesize the circuits in a 65 nm CMOS
technology and we show that the proposed integral stochastic architecture
results in up to 21% reduction in energy consumption compared to the binary
radix implementation at the same misclassification rate. Due to fault-tolerant
nature of stochastic architectures, we also consider a quasi-synchronous
implementation which yields 33% reduction in energy consumption w.r.t. the
binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure
Customisable arithmetic hardware designs
Imperial Users onl
Training deep neural networks with low precision multiplications
Multipliers are the most space and power-hungry arithmetic operators of the
digital implementation of deep neural networks. We train a set of
state-of-the-art neural networks (Maxout networks) on three benchmark datasets:
MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats:
floating point, fixed point and dynamic fixed point. For each of those datasets
and for each of those formats, we assess the impact of the precision of the
multiplications on the final error after training. We find that very low
precision is sufficient not just for running trained networks but also for
training them. For example, it is possible to train Maxout networks with 10
bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201
Using Statistical Assertions to Guide Self-Adaptive Systems
Self-adaptive systems need to monitor themselves, to check their internal behaviour and design assumptions about runtime inputs and conditions. This kind of monitoring for self-adaptive systems can include collecting statistics about such systems themselves which can be computationally intensive (for detailed statistics) and hence time consuming, with possible negative impact on self-adaptive response time. To mitigate this limitation, we extend the technique of in-circuit runtime assertions to cover statistical assertions in hardware. The presented designs implement several statistical operators that can be exploited by self-adaptive systems; a novel optimization is developed for reducing the number of pairwise operators from ON to OlogâĄN. To illustrate the practicability and industrial relevance of our proposed approach, we evaluate our designs, chosen from a class of possible application scenarios, for their resource usage and the tradeoffs between hardware and software implementations
Floating Point Calculation of the Cube Function on FPGAs
© 2023 IEEE. This version of the paper has been accepted for publication. Personal use of
this material is permitted. Permission from IEEE must be obtained for all other uses, in
any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in
other works.
The final published paper is available online at:
https://doi.org/10.1109/TPDS.2022.3220039[Abstract]: Specialized arithmetic units allow fast and efficient computation of lesser used mathematical functions. The overall impact of those units would be negligible in a general purpose processor, as added circuitry makes chips more complex despite most software would seldom make use of it. On the opposite side, custom computing machines are built for a specific task, and they can always benefit from specialized units if they are available. In this work, floating point architectures are proposed for computing the cube on Intel and Xilinx FPGAs. Those implementations reduce the cost and latency compared to using simple floating point multiplications and squarers.This work was supported by the Ministry of Science
and Innovation of Spain (PID2019-104184RB-I00 / AEI
/ 10.13039/501100011033), and by Xunta de Galicia and
FEDER funds of the EU (Centro de InvestigaciÂŽon de Galicia
accreditation 2019â2022, ref. ED431G 2019/01; Consolidation
Program of Competitive Reference Groups, ref. ED431C
2021/30).Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/3
Implementing Homomorphic Encryption Based Secure Feedback Control for Physical Systems
This paper is about an encryption based approach to the secure implementation
of feedback controllers for physical systems. Specifically, Paillier's
homomorphic encryption is used to digitally implement a class of linear dynamic
controllers, which includes the commonplace static gain and PID type feedback
control laws as special cases. The developed implementation is amenable to
Field Programmable Gate Array (FPGA) realization. Experimental results,
including timing analysis and resource usage characteristics for different
encryption key lengths, are presented for the realization of an inverted
pendulum controller; as this is an unstable plant, the control is necessarily
fast
TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems
Processing-in-memory (PIM) promises to alleviate the data movement bottleneck
in modern computing systems. However, current real-world PIM systems have the
inherent disadvantage that their hardware is more constrained than in
conventional processors (CPU, GPU), due to the difficulty and cost of building
processing elements near or inside the memory. As a result, general-purpose PIM
architectures support fairly limited instruction sets and struggle to execute
complex operations such as transcendental functions and other hard-to-calculate
operations (e.g., square root). These operations are particularly important for
some modern workloads, e.g., activation functions in machine learning
applications.
In order to provide support for transcendental (and other hard-to-calculate)
functions in general-purpose PIM systems, we present \emph{TransPimLib}, a
library that provides CORDIC-based and LUT-based methods for trigonometric
functions, hyperbolic functions, exponentiation, logarithm, square root, etc.
We develop an implementation of TransPimLib for the UPMEM PIM architecture and
perform a thorough evaluation of TransPimLib's methods in terms of performance
and accuracy, using microbenchmarks and three full workloads (Blackscholes,
Sigmoid, Softmax). We open-source all our code and datasets
at~\url{https://github.com/CMU-SAFARI/transpimlib}.Comment: Our open-source software is available at
https://github.com/CMU-SAFARI/transpimli
Dedicated Hardware for Complex Mathematical Operations
New hardware FPGA implementations for the efficient computations of division, natural logarithm and exponential function are proposed. The proposed implementations use generic floating-point adder and multiplier with small additional resources that are shared to compute more frequently used multiply and accumulate operations. Hardware sharing improved the resource utilization. The time of the computation has been reduced to only 6 clock cycles when the natural logarithm and exponential function are calculated. The division is calculated in 5 clock cycles. They are designed as technology independent high throughput computing cores with minimized memory requirements which can be used in higher numbers to significantly increased calculation speed in spectral processing. A new universal arithmetic floating-point unit is also proposed
- âŠ