444 research outputs found

    Evolutionary Cell Aided Design for Neural Network Architectures

    Full text link
    Mathematical theory shows us that multilayer feedforward Artificial Neural Networks(ANNs) are universal function approximators, capable of approximating any measurable function to any desired degree of accuracy. In practice designing practical and efficient neural network architectures require significant effort and expertise. We present a novel software framework called Evolutionary Cell Aided Design(ECAD) meant to aid in the exploration and design of efficient Neural Network Architectures(NNAs) for reconfigurable hardware. Given a general neural network structure and a set of constraints and fitness functions, the framework will explore both the space of possible NNA and the space of possible hardware designs, using evolutionary algorithms, and attempt to find the fittest co-design solutions according to a predefined set of goals. We test the framework on an image classification task and use the MNIST data set of hand written digits with an Intel Arria 10 GX 1150 device as our target platform. We design and implement a modular and scalable 2D systolic array with enhancements for machine learning that can be used by the framework for the hardware search space. Our results demonstrate the ability to pair neural network design and hardware development together using an evolutionary algorithm and removing traditional human-in-the-loop development tasks. By running various experiments of the fittest solutions for neural network and hardware searches, we demonstrate the full end-to-end capabilities of the ECAD framework.Comment: Text and image edit

    PIRT: A Runtime Framework to Enable Energy-Efficient Real-Time Robotic Applications on Heterogeneous Architectures

    Full text link
    Enabling full robotic workloads with diverse behaviors on mobile systems with stringent resource and energy constraints remains a challenge. In recent years, attempts have been made to deploy single-accelerator-based computing platforms (such as GPU, DSP, or FPGA) to address this challenge, but with little success. The core problem is two-fold: firstly, different robotic tasks require different accelerators, and secondly, managing multiple accelerators simultaneously is overwhelming for developers. In this paper, we propose PIRT, the first robotic runtime framework to efficiently manage dynamic task executions on mobile systems with multiple accelerators as well as on the cloud to achieve better performance and energy savings. With PIRT, we enable a robot to simultaneously perform autonomous navigation with 25 FPS of localization, obstacle detection with 3 FPS, route planning, large map generation, and scene understanding, traveling at a max speed of 5 miles per hour, all within an 11W computing power envelope

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Parallel Programming Models for Heterogeneous Many-Cores : A Survey

    Full text link
    Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance Computin

    From DNNs to GANs: Review of efficient hardware architectures for deep learning

    Full text link
    In recent times, the trend in very large scale integration (VLSI) industry is multi-dimensional, for example, reduction of energy consumption, occupancy of less space, precise result, less power dissipation, faster response. To meet these needs, the hardware architecture should be reliable and robust to these problems. Recently, neural network and deep learning has been started to impact the present research paradigm significantly which consists of parameters in the order of millions, nonlinear function for activation, convolutional operation for feature extraction, regression for classification, generative adversarial networks. These operations involve huge calculation and memory overhead. Presently available DSP processors are incapable of performing these operations and they mostly face the problems, for example, memory overhead, performance drop and compromised accuracy. Moreover, if a huge silicon area is powered to accelerate the operation using parallel computation, the ICs will be having significant chance of burning out due to the considerable generation of heat. Hence, novel dark silicon constraint is developed to reduce the heat dissipation without sacrificing the accuracy. Similarly, different algorithms have been adapted to design a DSP processor compatible for fast performance in neural network, activation function, convolutional neural network and generative adversarial network. In this review, we illustrate the recent developments in hardware for accelerating the efficient implementation of deep learning networks with enhanced performance. The techniques investigated in this review are expected to direct future research challenges of hardware optimization for high-performance computations

    FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10

    Full text link
    Deep learning and Convolutional Neural Network (CNN) have becoming increasingly more popular and important in both academic and industrial areas in recent years cause they are able to provide better accuracy and result in classification, detection and recognition areas, compared to traditional approaches. Currently, there are many popular frameworks in the market for deep learning development, such as Caffe, TensorFlow, Pytorch, and most of frameworks natively support CPU and consider GPU as the mainline accelerator by default. FPGA device, viewed as a potential heterogeneous platform, still cannot provide a comprehensive support for CNN development in popular frameworks, in particular to the training phase. In this paper, we firstly propose the FeCaffe, i.e. FPGA-enabled Caffe, a hierarchical software and hardware design methodology based on the Caffe to enable FPGA to support mainline deep learning development features, e.g. training and inference with Caffe. Furthermore, we provide some benchmarks with FeCaffe by taking some classical CNN networks as examples, and further analysis of kernel execution time in details accordingly. Finally, some optimization directions including FPGA kernel design, system pipeline, network architecture, user case application and heterogeneous platform levels, have been proposed gradually to improve FeCaffe performance and efficiency. The result demonstrates the proposed FeCaffe is capable of supporting almost full features during CNN network training and inference respectively with high degree of design flexibility, expansibility and reusability for deep learning development. Compared to prior studies, our architecture can support more network and training settings, and current configuration can achieve 6.4x and 8.4x average execution time improvement for forward and backward respectively for LeNet.Comment: 11 pages, 7 figures and 4 table

    CNN2Gate: Toward Designing a General Framework for Implementation of Convolutional Neural Networks on FPGA

    Full text link
    Convolutional Neural Networks (CNNs) have a major impact on our society because of the numerous services they provide. On the other hand, they require considerable computing power. To satisfy these requirements, it is possible to use graphic processing units (GPUs). However, high power consumption and limited external IOs constrain their usability and suitability in industrial and mission-critical scenarios. Recently, the number of researches that utilize FPGAs to implement CNNs are increasing rapidly. This is due to the lower power consumption and easy reconfigurability offered by these platforms. Because of the research efforts put into topics such as architecture, synthesis and optimization, some new challenges are arising to integrate such hardware solutions to high-level machine learning software libraries. This paper introduces an integrated framework (CNN2Gate) that supports compilation of a CNN model for an FPGA target. CNN2Gate exploits the OpenCL synthesis workflow for FPGAs offered by commercial vendors. CNN2Gate is capable of parsing CNN models from several popular high-level machine learning libraries such as Keras, Pytorch, Caffe2 etc. CNN2Gate extracts computation flow of layers, in addition to weights and biases and applies a "given" fixed-point quantization. Furthermore, it writes this information in the proper format for OpenCL synthesis tools that are then used to build and run the project on FPGA. CNN2Gate performs design-space exploration using a reinforcement learning agent and fits the design on different FPGAs with limited logic resources automatically. This paper reports results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms. CNN2Gate achieves a latency of 205 ms for VGG-16 and 18 ms for AlexNet on the FPGA

    If-Conversion Optimization using Neuro Evolution of Augmenting Topologies

    Full text link
    Control-flow dependence is an intrinsic limiting factor for pro- gram acceleration. With the availability of instruction-level par- allel architectures, if-conversion optimization has, therefore, be- come pivotal for extracting parallelism from serial programs. While many if-conversion optimization heuristics have been proposed in the literature, most of them consider rigid criteria regardless of the underlying hardware and input programs. In this paper, we propose a novel if-conversion scheme that preforms an efficient if-conversion transformation using a machine learning technique (NEAT). This method enables if-conversion customization overall branches within a program unlike the literature that considered in- dividual branches. Our technique also provides flexibility required when compiling for heterogeneous systems. The efficacy of our approach is shown by experiments and reported results which il- lustrate that the programs can be accelerated on the same archi- tecture and without modifying the original code. Our technique applies for general purpose programming languages (e.g. C/C++) and is transparent for the programmer. We implemented our tech- nique in LLVM 3.6.1 compilation infrastructure and experimented on the kernels of SPEC-CPU2006 v1.1 benchmarks suite running on a multicore system of Intel(R) Xeon(R) 3.50GHz processors. Our findings show a performance gain up to 8.6% over the stan- dard optimized code (LLVM -O2 with if-conversion included), in- dicating the need for If-conversion compilation optimization that can adapt to the unique characteristics of every individual branch.Comment: Part of the Program Transformation for Programmability in Heterogeneous Architectures (PROHA) workshop, Barcelona, Spain, 12th March 2016, 6 pages, LaTeX, 2 PDF figure

    Automatic Loop Tuning and Memory Management for Stencil Computations

    Get PDF
    The Texas Instruments C66x Digital Signal Processor (DSP) is an embedded processor technology that is targeted at real time signal processing. It is also developed with a high potential to become the new generation of coprocessor technology for high performance embedded computing. Of particular interest is its performance for stencil computations, such as those found in signal processing and computer vision tasks. A stencil is a loop in which the output value is updated at each position of an array by taking a weighted function of its neighbors. Efficiently mapping stencil-based kernels to the C66x device presents two challenges. The first one is how to efficiently optimize loops in order to facilitate the usage of Single Instruction Multiple Data (SIMD) instructions. On this architecture, like most others, SIMD instructions are not directly generated by the compiler. The second problem is how to manage on-chip memory in a way that minimizes off-chip memory access. Although this could theoretically be achieved by using a highly associative cache, the high rate of data reuse in stencil loops causes a high conflict miss rate. One way to solve this problem is to configure the on-chip memory as a program controlled scratchpad. It allows user to buffer a 2D block of data and minimizes the off-chip data access. For this dissertation, we have accomplished two goals: (1) Develop a methodology for optimization of arbitrary 2D stencils that fully utilize SIMD instructions through microachitecture-aware loop unrolling. (2) Deliver an easy-to-use scratchpad buffer management system and use it to improve the memory efficiency for 2D stencils. We show in the results and analysis section that our stencil compiler is able to achieve up to 2x speed up compared with the code generated by the industrial standard compiler developed by Texas Instruments, and our memory management system is able to achieve up to 10x speed up compared with cache

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs
    • …
    corecore