52 research outputs found

    Revisiting the high-performance reconfigurable computing for future datacenters

    Get PDF
    Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well

    Improving Performance Estimation for Design Space Exploration for Convolutional Neural Network Accelerators

    Get PDF
    Contemporary advances in neural networks (NNs) have demonstrated their potential in different applications such as in image classification, object detection or natural language processing. In particular, reconfigurable accelerators have been widely used for the acceleration of NNs due to their reconfigurability and efficiency in specific application instances. To determine the configuration of the accelerator, it is necessary to conduct design space exploration to optimize the performance. However, the process of design space exploration is time consuming because of the slow performance evaluation for different configurations. Therefore, there is a demand for an accurate and fast performance prediction method to speed up design space exploration. This work introduces a novel method for fast and accurate estimation of different metrics that are of importance when performing design space exploration. The method is based on a Gaussian process regression model parametrised by the features of the accelerator and the target NN to be accelerated. We evaluate the proposed method together with other popular machine learning based methods in estimating the latency and energy consumption of our implemented accelerator on two different hardware platforms targeting convolutional neural networks. We demonstrate improvements in estimation accuracy, without the need for significant implementation effort or tuning

    An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface

    Get PDF
    High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N x N matrices, the architecture requires N ALU-RAM blocks and performs O(N*N), requiring N*N +7 and N +7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 x 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units

    FPGA-Based Processor Acceleration for Image Processing Applications

    Get PDF
    FPGA-based embedded image processing systems offer considerable computing resources but present programming challenges when compared to software systems. The paper describes an approach based on an FPGA-based soft processor called Image Processing Processor (IPPro) which can operate up to 337 MHz on a high-end Xilinx FPGA family and gives details of the dataflow-based programming environment. The approach is demonstrated for a k-means clustering operation and a traffic sign recognition application, both of which have been prototyped on an Avnet Zedboard that has Xilinx Zynq-7000 system-on-chip (SoC). A number of parallel dataflow mapping options were explored giving a speed-up of 8 times for the k-means clustering using 16 IPPro cores, and a speed-up of 9.6 times for the morphology filter operation of the traffic sign recognition using 16 IPPro cores compared to their equivalent ARM-based software implementations. We show that for k-means clustering, the 16 IPPro cores implementation is 57, 28 and 1.7 times more power efficient (fps/W) than ARM Cortex-A7 CPU, nVIDIA GeForce GTX980 GPU and ARM Mali-T628 embedded GPU respectively

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Fault tolerance in WBAN applications

    Get PDF
    One of the most promising applications of IoT is Wireless Body Area Net-works (WBANs) in medical applications. They allow physiological signals monitoring of patients without the presence of nearby medical personnel. Furthermore, WBANs enable feedback action to be taken either periodically or event-based following the Networked Control Systems (NCSs) techniques. This thesis first presents the architecture of a fault tolerant WBAN. Sensors data are sent over two redundant paths to be processed, analyzed and monitored. The two main communication protocols utilized in this system are Low power Wi-Fi (IEEE 802.11n) and Long Term Evolution (LTE). Riverbed Modeler is used to study the system’s behavior. Simulation results are collected with 95% confidence analysis on 33 runs on different initial seeds. It is proven that the system is fully operational. It is then shown that the system can withstand interference and system’s performance is quantified. Results indicate that the system succeeds in meeting all required control criteria in the presence of two different interference models. The second contribution of this thesis is the design of an FPGA-based smart band for health monitoring applications in WBANs. This FPGA-based smart band has a softcore processor and its allocated SRAM block as well as auxiliary modules. A novel scheme for full initial configuration and Dynamic Partial Reconfiguration through the WLAN network is integrated into this design. Fault tolerance techniques are used to mitigate transient faults such as Single Event Upsets (SEUs) and Multiple Event Upsets (MEUs). The system is studied in a normal environment as well as in a harsh environment. System availability is then obtained using Markov Models and a case study is presented

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

    Efficient Procedure Improving Precision of High Conditioned Matrices in Electronic Circuits Analysis

    Get PDF
    In this article, we propose several improvements that could be done to SPICE simulator. The first is based on a functional implementation of device models. The advantages of functional implementation are demonstrated on basic Shichman-Hodges model of MOS transistor. It starts with a description of primary algorithms used in SPICE simulator for the solution of circuits with nonlinear devices and identify the problems that can occur during simulations.Main part of the article is devoted to improved factorization procedure for simulation of the nonlinear electronic circuits. The primary intention of the proposed method is to increase final precision of the result in a case of high condition linear systems. The procedure is based on a use of the iterative methods for solution of nonlinear and linear equations. Utilizing those methods for one iterative process helps to reduce memory consumption during simulation computation, and it can significantly improve simulation precision. The procedure allows to use enumeration with definable precision in a very efficient way
    corecore