444 research outputs found
Evolutionary Cell Aided Design for Neural Network Architectures
Mathematical theory shows us that multilayer feedforward Artificial Neural
Networks(ANNs) are universal function approximators, capable of approximating
any measurable function to any desired degree of accuracy. In practice
designing practical and efficient neural network architectures require
significant effort and expertise. We present a novel software framework called
Evolutionary Cell Aided Design(ECAD) meant to aid in the exploration and design
of efficient Neural Network Architectures(NNAs) for reconfigurable hardware.
Given a general neural network structure and a set of constraints and fitness
functions, the framework will explore both the space of possible NNA and the
space of possible hardware designs, using evolutionary algorithms, and attempt
to find the fittest co-design solutions according to a predefined set of goals.
We test the framework on an image classification task and use the MNIST data
set of hand written digits with an Intel Arria 10 GX 1150 device as our target
platform. We design and implement a modular and scalable 2D systolic array with
enhancements for machine learning that can be used by the framework for the
hardware search space. Our results demonstrate the ability to pair neural
network design and hardware development together using an evolutionary
algorithm and removing traditional human-in-the-loop development tasks. By
running various experiments of the fittest solutions for neural network and
hardware searches, we demonstrate the full end-to-end capabilities of the ECAD
framework.Comment: Text and image edit
PIRT: A Runtime Framework to Enable Energy-Efficient Real-Time Robotic Applications on Heterogeneous Architectures
Enabling full robotic workloads with diverse behaviors on mobile systems with
stringent resource and energy constraints remains a challenge. In recent years,
attempts have been made to deploy single-accelerator-based computing platforms
(such as GPU, DSP, or FPGA) to address this challenge, but with little success.
The core problem is two-fold: firstly, different robotic tasks require
different accelerators, and secondly, managing multiple accelerators
simultaneously is overwhelming for developers. In this paper, we propose PIRT,
the first robotic runtime framework to efficiently manage dynamic task
executions on mobile systems with multiple accelerators as well as on the cloud
to achieve better performance and energy savings. With PIRT, we enable a robot
to simultaneously perform autonomous navigation with 25 FPS of localization,
obstacle detection with 3 FPS, route planning, large map generation, and scene
understanding, traveling at a max speed of 5 miles per hour, all within an 11W
computing power envelope
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Parallel Programming Models for Heterogeneous Many-Cores : A Survey
Heterogeneous many-cores are now an integral part of modern computing systems
ranging from embedding systems to supercomputers. While heterogeneous many-core
design offers the potential for energy-efficient high-performance, such
potential can only be unlocked if the application programs are suitably
parallel and can be made to match the underlying heterogeneous platform. In
this article, we provide a comprehensive survey for parallel programming models
for heterogeneous many-core architectures and review the compiling techniques
of improving programmability and portability. We examine various software
optimization techniques for minimizing the communicating overhead between
heterogeneous computing devices. We provide a road map for a wide variety of
different research areas. We conclude with a discussion on open issues in the
area and potential research directions. This article provides both an
accessible introduction to the fast-moving area of heterogeneous programming
and a detailed bibliography of its main achievements.Comment: Accepted to be published at CCF Transactions on High Performance
Computin
From DNNs to GANs: Review of efficient hardware architectures for deep learning
In recent times, the trend in very large scale integration (VLSI) industry is
multi-dimensional, for example, reduction of energy consumption, occupancy of
less space, precise result, less power dissipation, faster response. To meet
these needs, the hardware architecture should be reliable and robust to these
problems. Recently, neural network and deep learning has been started to impact
the present research paradigm significantly which consists of parameters in the
order of millions, nonlinear function for activation, convolutional operation
for feature extraction, regression for classification, generative adversarial
networks. These operations involve huge calculation and memory overhead.
Presently available DSP processors are incapable of performing these operations
and they mostly face the problems, for example, memory overhead, performance
drop and compromised accuracy. Moreover, if a huge silicon area is powered to
accelerate the operation using parallel computation, the ICs will be having
significant chance of burning out due to the considerable generation of heat.
Hence, novel dark silicon constraint is developed to reduce the heat
dissipation without sacrificing the accuracy. Similarly, different algorithms
have been adapted to design a DSP processor compatible for fast performance in
neural network, activation function, convolutional neural network and
generative adversarial network. In this review, we illustrate the recent
developments in hardware for accelerating the efficient implementation of deep
learning networks with enhanced performance. The techniques investigated in
this review are expected to direct future research challenges of hardware
optimization for high-performance computations
FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10
Deep learning and Convolutional Neural Network (CNN) have becoming
increasingly more popular and important in both academic and industrial areas
in recent years cause they are able to provide better accuracy and result in
classification, detection and recognition areas, compared to traditional
approaches. Currently, there are many popular frameworks in the market for deep
learning development, such as Caffe, TensorFlow, Pytorch, and most of
frameworks natively support CPU and consider GPU as the mainline accelerator by
default. FPGA device, viewed as a potential heterogeneous platform, still
cannot provide a comprehensive support for CNN development in popular
frameworks, in particular to the training phase. In this paper, we firstly
propose the FeCaffe, i.e. FPGA-enabled Caffe, a hierarchical software and
hardware design methodology based on the Caffe to enable FPGA to support
mainline deep learning development features, e.g. training and inference with
Caffe. Furthermore, we provide some benchmarks with FeCaffe by taking some
classical CNN networks as examples, and further analysis of kernel execution
time in details accordingly. Finally, some optimization directions including
FPGA kernel design, system pipeline, network architecture, user case
application and heterogeneous platform levels, have been proposed gradually to
improve FeCaffe performance and efficiency. The result demonstrates the
proposed FeCaffe is capable of supporting almost full features during CNN
network training and inference respectively with high degree of design
flexibility, expansibility and reusability for deep learning development.
Compared to prior studies, our architecture can support more network and
training settings, and current configuration can achieve 6.4x and 8.4x average
execution time improvement for forward and backward respectively for LeNet.Comment: 11 pages, 7 figures and 4 table
CNN2Gate: Toward Designing a General Framework for Implementation of Convolutional Neural Networks on FPGA
Convolutional Neural Networks (CNNs) have a major impact on our society
because of the numerous services they provide. On the other hand, they require
considerable computing power. To satisfy these requirements, it is possible to
use graphic processing units (GPUs). However, high power consumption and
limited external IOs constrain their usability and suitability in industrial
and mission-critical scenarios. Recently, the number of researches that utilize
FPGAs to implement CNNs are increasing rapidly. This is due to the lower power
consumption and easy reconfigurability offered by these platforms. Because of
the research efforts put into topics such as architecture, synthesis and
optimization, some new challenges are arising to integrate such hardware
solutions to high-level machine learning software libraries. This paper
introduces an integrated framework (CNN2Gate) that supports compilation of a
CNN model for an FPGA target. CNN2Gate exploits the OpenCL synthesis workflow
for FPGAs offered by commercial vendors. CNN2Gate is capable of parsing CNN
models from several popular high-level machine learning libraries such as
Keras, Pytorch, Caffe2 etc. CNN2Gate extracts computation flow of layers, in
addition to weights and biases and applies a "given" fixed-point quantization.
Furthermore, it writes this information in the proper format for OpenCL
synthesis tools that are then used to build and run the project on FPGA.
CNN2Gate performs design-space exploration using a reinforcement learning agent
and fits the design on different FPGAs with limited logic resources
automatically. This paper reports results of automatic synthesis and
design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms.
CNN2Gate achieves a latency of 205 ms for VGG-16 and 18 ms for AlexNet on the
FPGA
If-Conversion Optimization using Neuro Evolution of Augmenting Topologies
Control-flow dependence is an intrinsic limiting factor for pro- gram
acceleration. With the availability of instruction-level par- allel
architectures, if-conversion optimization has, therefore, be- come pivotal for
extracting parallelism from serial programs. While many if-conversion
optimization heuristics have been proposed in the literature, most of them
consider rigid criteria regardless of the underlying hardware and input
programs. In this paper, we propose a novel if-conversion scheme that preforms
an efficient if-conversion transformation using a machine learning technique
(NEAT). This method enables if-conversion customization overall branches within
a program unlike the literature that considered in- dividual branches. Our
technique also provides flexibility required when compiling for heterogeneous
systems. The efficacy of our approach is shown by experiments and reported
results which il- lustrate that the programs can be accelerated on the same
archi- tecture and without modifying the original code. Our technique applies
for general purpose programming languages (e.g. C/C++) and is transparent for
the programmer. We implemented our tech- nique in LLVM 3.6.1 compilation
infrastructure and experimented on the kernels of SPEC-CPU2006 v1.1 benchmarks
suite running on a multicore system of Intel(R) Xeon(R) 3.50GHz processors. Our
findings show a performance gain up to 8.6% over the stan- dard optimized code
(LLVM -O2 with if-conversion included), in- dicating the need for If-conversion
compilation optimization that can adapt to the unique characteristics of every
individual branch.Comment: Part of the Program Transformation for Programmability in
Heterogeneous Architectures (PROHA) workshop, Barcelona, Spain, 12th March
2016, 6 pages, LaTeX, 2 PDF figure
Automatic Loop Tuning and Memory Management for Stencil Computations
The Texas Instruments C66x Digital Signal Processor (DSP) is an embedded processor technology that is targeted at real time signal processing. It is also developed with a high potential to become the new generation of coprocessor technology for high performance embedded computing. Of particular interest is its performance for stencil computations, such as those found in signal processing and computer vision tasks. A stencil is a loop in which the output value is updated at each position of an array by taking a weighted function of its neighbors. Efficiently mapping stencil-based kernels to the C66x device presents two challenges. The first one is how to efficiently optimize loops in order to facilitate the usage of Single Instruction Multiple Data (SIMD) instructions. On this architecture, like most others, SIMD instructions are not directly generated by the compiler. The second problem is how to manage on-chip memory in a way that minimizes off-chip memory access. Although this could theoretically be achieved by using a highly associative cache, the high rate of data reuse in stencil loops causes a high conflict miss rate. One way to solve this problem is to configure the on-chip memory as a program controlled scratchpad. It allows user to buffer a 2D block of data and minimizes the off-chip data access. For this dissertation, we have accomplished two goals: (1) Develop a methodology for optimization of arbitrary 2D stencils that fully utilize SIMD instructions through microachitecture-aware loop unrolling. (2) Deliver an easy-to-use scratchpad buffer management system and use it to improve the memory efficiency for 2D stencils. We show in the results and analysis section that our stencil compiler is able to achieve up to 2x speed up compared with the code generated by the industrial standard compiler developed by Texas Instruments, and our memory management system is able to achieve up to 10x speed up compared with cache
Image Processing Using FPGAs
This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs
- …