605 research outputs found

    Design Space Exploration of Neural Network Activation Function Circuits

    Full text link
    The widespread application of artificial neural networks has prompted researchers to experiment with FPGA and customized ASIC designs to speed up their computation. These implementation efforts have generally focused on weight multiplication and signal summation operations, and less on activation functions used in these applications. Yet, efficient hardware implementations of nonlinear activation functions like Exponential Linear Units (ELU), Scaled Exponential Linear Units (SELU), and Hyperbolic Tangent (tanh), are central to designing effective neural network accelerators, since these functions require lots of resources. In this paper, we explore efficient hardware implementations of activation functions using purely combinational circuits, with a focus on two widely used nonlinear activation functions, i.e., SELU and tanh. Our experiments demonstrate that neural networks are generally insensitive to the precision of the activation function. The results also prove that the proposed combinational circuit-based approach is very efficient in terms of speed and area, with negligible accuracy loss on the MNIST, CIFAR-10 and IMAGENET benchmarks. Synopsys Design Compiler synthesis results show that circuit designs for tanh and SELU can save between 3.13-7.69 and 4.45-8:45 area compared to the LUT/memory-based implementations, and can operate at 5.14GHz and 4.52GHz using the 28nm SVT library, respectively. The implementation is available at: https://github.com/ThomasMrY/ActivationFunctionDemo.Comment: 5 pages, 5 figures, 16 conferenc

    NACU: A Non-Linear Arithmetic Unit for Neural Networks

    Get PDF
    Reconfigurable architectures targeting neural networks are an attractive option. They allow multiple neural networks of different types to be hosted on the same hardware, in parallel or sequence. Reconfigurability also grants the ability to morph into different micro-architectures to meet varying power-performance constraints. In this context, the need for a reconfigurable non-linear computational unit has not been widely researched. In this work, we present a formal and comprehensive method to select the optimal fixed-point representation to achieve the highest accuracy against the floating-point implementation benchmark. We also present a novel design of an optimised reconfigurable arithmetic unit for calculating non-linear functions. The unit can be dynamically configured to calculate the sigmoid, hyperbolic tangent, and exponential function using the same underlying hardware. We compare our work with the state-of-the-art and show that our unit can calculate all three functions without loss of accuracy

    Evaluation of flexible SPA based LPDC decoder using hardware friendly approximation methods

    Get PDF
    Due to computation-intensive nature of LDPC decoders, a lot of research is going towards efficient implementation of their original algorithm (SPA). As "Min-Sum" approximation is basically an overestimation of SPA, this thesis investigates more accurate, yet area efficient, approximations of SPA, to select an optimum one. In a general comparison between main approximation methods (e.g. LUT, PWL, CRI), PWL showed the most area-efficiency. Studying different mathematical formats of SPA, Soft-XOR based format with forward-backward scheme was chosen for hard- ware implementation. Its core function (Soft-XOR) was implemented with CRI approximation, which achieved the highest efficiency, compare to other approxi- mations. Using this core function, a flexible, pipe-lined, Soft-XOR based CNU (the computational unit of LDPC decoders) with forward-backward architecture was developed in 18nm CMOS. The implemented CNU’s area and speed can eas- ily be changed in instantiation. A SPA decoder based on the developed CNU was estimated to have an area of 1.6M as equivalent gate count and a throughput of 10Gb/s, with a frequency of 1.25GHz and for 10 iterations. The decoder uses IEEE 802.11n Wi-Fi standard with flooding schedule. The BER/SNR loss, com- pare to floating-point SPA, is 0.3dB for 10 iterations and less than 0.1dB for 20 iterations.You have to get lost before you can be found, a quote by Jeff Rasley goes very well for Low Density Parity Check (LDPC) codes. First invented by Gallager in 1962 but kind of lost during the journey of evolution of telecommunication networks because of their high complexity and demanding computations, which technology was not so advanced to handle, at that time. However, during late 1990s, success of turbo codes invoked the re-discovery of Low Density Parity Check (LDPC) codes. Recently it has attracted tremendous research interest among the scientific com- munity, as today’s technology is advanced enough and to make LDPC decoders completely commercial. In a wireless network, the information is not just sim- ply sent, but first encoded. In a sense, all the transmitted bits are tied together, according to some mathematical rules. Therefore, if noise destructs parts of the information while traveling, the LDPC decoder at the receiver side, can automat- ically detect and retrieve those parts, based on the other parts. Here, our main focus is on the decoder. For actual hardware implementation of the decoder, some level of approximation of the ideal algorithm is always necessary, which reduces the accuracy depending on the approximation. Ericsson is developing the next-generation wireless network for 5G, and already possesses the "Min-Sum" approximation of the LDPC decoder. As the current requirements demand more accurate decoders, the goal of this thesis is to evalu- ate a more accurate but more costly version of the LDPC decoder, as well as its flexibility. Thus, several candidates were selected and evaluated based on their complexity, cost, and their accuracy towards error correction. After performing several trade-offs, an approximation method is chosen and the corresponding cost is derived. With this acquired data, a trade-off between accuracy and cost can be made, depending on the application

    AHEAD: Automatic Holistic Energy-Aware Design Methodology for MLP Neural Network Hardware Generation in Proactive BMI Edge Devices

    Get PDF
    The prediction of a high-level cognitive function based on a proactive brain–machine interface (BMI) control edge device is an emerging technology for improving the quality of life for disabled people. However, maintaining the stability of multiunit neural recordings is made difficult by the nonstationary nature of neurons and can affect the overall performance of proactive BMI control. Thus, it requires regular recalibration to retrain a neural network decoder for proactive control. However, retraining may lead to changes in the network parameters, such as the network topology. In terms of the hardware implementation of the neural decoder for real-time and low-power processing, it takes time to modify or redesign the hardware accelerator. Consequently, handling the engineering change of the low-power hardware design requires substantial human resources and time. To address this design challenge, this work proposes AHEAD: an automatic holistic energy-aware design methodology for multilayer perceptron (MLP) neural network hardware generation in proactive BMI edge devices. By taking a holistic analysis of the proactive BMI design flow, the approach makes judicious use of the intelligent bit-width identification (BWID) and configurable hardware generation, which autonomously integrate to generate the low-power hardware decoder. The proposed AHEAD methodology begins with the trained MLP parameters and golden datasets and produces an efficient hardware design in terms of performance, power, and area (PPA) with the least loss of accuracy. The results show that the proposed methodology is up to a 4X faster in performance, 3X lower in terms of power consumption, and achieves a 5X reduction in area resources, with exact accuracy, compared to floating-point and half-floating-point design on a field-programmable gate array (FPGA), which makes it a promising design methodology for proactive BMI edge devices

    Mixed-Signal VLSI Implementation of CVNS Artificial Neural Networks

    Get PDF
    In this work, mixed-signal implementation of Continuous Valued Number System (CVNS) neural network is proposed. The proposed network resolves the limited signal processing precision issue present in mixed-signal neural networks. This is realized by the CVNS addition, the CVNS multiplication and the CVNS sigmoid function evaluation algorithms proposed in this dissertation. The proposed algorithms provide accurate results in low-resolution environment. In addition, an area-efficient low sensitivity CVNS Madaline is proposed. The proposed Madaline is more robust to input and weight errors when compared to the previously developed structures. Moreover, its area consumption is lower. Furthermore, a new approximation scheme for hyperbolic tangent activation function is proposed. Using the proposed approximation scheme results in efficient implementation of digital ASIC neural networks in terms of area, delay and power consumption

    Long Short Term Based Memory Hardware Prefetcher

    Get PDF
    Hardware prefetching is an efficient way to hide cache miss penalty due to long memory access latency. Accuracy, coverage, and timeliness are three primary metrics in evaluating hardware prefetcher design. Highly accurate hardware prefetches are required to predict complex memory access patterns in multicore systems. In this paper, we propose a long short term memory (LSTM) prefetcher---a neural network based hardware prefetcher to achieve high prefetch accuracy and coverage while improving prefetch timeliness. The proposed LSTM prefetcher achieves higher accuracy and coverage by training neural networks to predict long memory access patterns. LSTM can improve timeliness in two approaches. First, multiple prefetch can be issued on a single cache access. Second, a simple Next-N-Line prefetcher is integrated with the LSTM prefetcher to accelerate predictions when good spatial locality exists. The proposed LSTM prefetcher is the first prefetcher design that uses recurrent neuron network. Three case studies are presented, which show that proposed LSTM prefetcher can achieve 98.6\%, 83.5\%, and 61\% accuracy respectively, while the state-of-art variable length delta prefetcher (VLDP) achieves 0\%, 75\% ,and 26.6\% accuracy in predicting the sequences in the case studies

    A versatile, scalable, and open memory architecture in CMOS 0.18 μm

    Get PDF
    A lookup table is a permanent memory storate element in which every stored value corresponds to a unique address. Range addressable lookup tables differ in that every stored value corresponds to a range of addresses. This type of memory has important applications in a recently proposed central processing unit which employs a multi-digit logarithmic number system that is well suited for digital signal processing applications. This thesis details the work done to improve range addressable lookup tables in terms of operating speed and area utilization. Two range addressable lookup table designs are proposed. Ideal design parameters are determined. An integrated circuit test platform is proposed to determine the real-world ability of these lookup tables. A case study exploring how non-linear functions can be approximated with range addressable lookup tables is presented
    corecore