174 research outputs found

    VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing

    Full text link
    The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture has been implemented on a Virtex7 FPGA, resulting in 45% and 62% average reductions in area and latency compared to the best reported architecture in literature. We also synthesize the circuits in a 65 nm CMOS technology and we show that the proposed integral stochastic architecture results in up to 21% reduction in energy consumption compared to the binary radix implementation at the same misclassification rate. Due to fault-tolerant nature of stochastic architectures, we also consider a quasi-synchronous implementation which yields 33% reduction in energy consumption w.r.t. the binary radix implementation without any compromise on performance.Comment: 11 pages, 12 figure

    Training deep neural networks with low precision multiplications

    Full text link
    Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them. For example, it is possible to train Maxout networks with 10 bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201

    Using Statistical Assertions to Guide Self-Adaptive Systems

    Get PDF
    Self-adaptive systems need to monitor themselves, to check their internal behaviour and design assumptions about runtime inputs and conditions. This kind of monitoring for self-adaptive systems can include collecting statistics about such systems themselves which can be computationally intensive (for detailed statistics) and hence time consuming, with possible negative impact on self-adaptive response time. To mitigate this limitation, we extend the technique of in-circuit runtime assertions to cover statistical assertions in hardware. The presented designs implement several statistical operators that can be exploited by self-adaptive systems; a novel optimization is developed for reducing the number of pairwise operators from ON to Olog⁥N. To illustrate the practicability and industrial relevance of our proposed approach, we evaluate our designs, chosen from a class of possible application scenarios, for their resource usage and the tradeoffs between hardware and software implementations

    Floating Point Calculation of the Cube Function on FPGAs

    Get PDF
    © 2023 IEEE. This version of the paper has been accepted for publication. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The final published paper is available online at: https://doi.org/10.1109/TPDS.2022.3220039[Abstract]: Specialized arithmetic units allow fast and efficient computation of lesser used mathematical functions. The overall impact of those units would be negligible in a general purpose processor, as added circuitry makes chips more complex despite most software would seldom make use of it. On the opposite side, custom computing machines are built for a specific task, and they can always benefit from specialized units if they are available. In this work, floating point architectures are proposed for computing the cube on Intel and Xilinx FPGAs. Those implementations reduce the cost and latency compared to using simple floating point multiplications and squarers.This work was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00 / AEI / 10.13039/501100011033), and by Xunta de Galicia and FEDER funds of the EU (Centro de InvestigaciÂŽon de Galicia accreditation 2019–2022, ref. ED431G 2019/01; Consolidation Program of Competitive Reference Groups, ref. ED431C 2021/30).Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/3

    Implementing Homomorphic Encryption Based Secure Feedback Control for Physical Systems

    Full text link
    This paper is about an encryption based approach to the secure implementation of feedback controllers for physical systems. Specifically, Paillier's homomorphic encryption is used to digitally implement a class of linear dynamic controllers, which includes the commonplace static gain and PID type feedback control laws as special cases. The developed implementation is amenable to Field Programmable Gate Array (FPGA) realization. Experimental results, including timing analysis and resource usage characteristics for different encryption key lengths, are presented for the realization of an inverted pendulum controller; as this is an unstable plant, the control is necessarily fast

    TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

    Full text link
    Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures support fairly limited instruction sets and struggle to execute complex operations such as transcendental functions and other hard-to-calculate operations (e.g., square root). These operations are particularly important for some modern workloads, e.g., activation functions in machine learning applications. In order to provide support for transcendental (and other hard-to-calculate) functions in general-purpose PIM systems, we present \emph{TransPimLib}, a library that provides CORDIC-based and LUT-based methods for trigonometric functions, hyperbolic functions, exponentiation, logarithm, square root, etc. We develop an implementation of TransPimLib for the UPMEM PIM architecture and perform a thorough evaluation of TransPimLib's methods in terms of performance and accuracy, using microbenchmarks and three full workloads (Blackscholes, Sigmoid, Softmax). We open-source all our code and datasets at~\url{https://github.com/CMU-SAFARI/transpimlib}.Comment: Our open-source software is available at https://github.com/CMU-SAFARI/transpimli

    Dedicated Hardware for Complex Mathematical Operations

    Get PDF
    New hardware FPGA implementations for the efficient computations of division, natural logarithm and exponential function are proposed. The proposed implementations use generic floating-point adder and multiplier with small additional resources that are shared to compute more frequently used multiply and accumulate operations. Hardware sharing improved the resource utilization. The time of the computation has been reduced to only 6 clock cycles when the natural logarithm and exponential function are calculated. The division is calculated in 5 clock cycles. They are designed as technology independent high throughput computing cores with minimized memory requirements which can be used in higher numbers to significantly increased calculation speed in spectral processing. A new universal arithmetic floating-point unit is also proposed
    • 

    corecore