33 research outputs found
FeCaffe: FPGA-enabled Caffe with OpenCL for Deep Learning Training and Inference on Intel Stratix 10
Deep learning and Convolutional Neural Network (CNN) have becoming
increasingly more popular and important in both academic and industrial areas
in recent years cause they are able to provide better accuracy and result in
classification, detection and recognition areas, compared to traditional
approaches. Currently, there are many popular frameworks in the market for deep
learning development, such as Caffe, TensorFlow, Pytorch, and most of
frameworks natively support CPU and consider GPU as the mainline accelerator by
default. FPGA device, viewed as a potential heterogeneous platform, still
cannot provide a comprehensive support for CNN development in popular
frameworks, in particular to the training phase. In this paper, we firstly
propose the FeCaffe, i.e. FPGA-enabled Caffe, a hierarchical software and
hardware design methodology based on the Caffe to enable FPGA to support
mainline deep learning development features, e.g. training and inference with
Caffe. Furthermore, we provide some benchmarks with FeCaffe by taking some
classical CNN networks as examples, and further analysis of kernel execution
time in details accordingly. Finally, some optimization directions including
FPGA kernel design, system pipeline, network architecture, user case
application and heterogeneous platform levels, have been proposed gradually to
improve FeCaffe performance and efficiency. The result demonstrates the
proposed FeCaffe is capable of supporting almost full features during CNN
network training and inference respectively with high degree of design
flexibility, expansibility and reusability for deep learning development.
Compared to prior studies, our architecture can support more network and
training settings, and current configuration can achieve 6.4x and 8.4x average
execution time improvement for forward and backward respectively for LeNet.Comment: 11 pages, 7 figures and 4 table
Using Efficient Path Profiling to Optimize Memory Consumption of On-Chip Debugging for High-Level Synthesis
High-Level Synthesis (HLS) for FPGAs is attracting popularity and is increasingly used to handle complex systems with multiple integrated components. To increase performance and efficiency, HLS flows now adopt several advanced optimization techniques. Aggressive optimizations and system level integration can cause the introduction of bugs that are only observable on-chip. Debugging support for circuits generated with HLS is receiving a considerable attention. Among the data that can be collected on chip for debugging, one of the most important is the state of the Finite State Machines (FSM) controlling the components of the circuit.
However, this usually requires a large amount of memory to trace the behavior during the execution. This work proposes an approach that takes advantage of the HLS information and of the structure of the FSM to compress control flow traces and to integrate optimized components for on-chip debugging. The generated checkers analyze the FSM execution on-fly, automatically notifying when a bug is detected, localizing it and providing data about its cause. The traces are compressed using a software profiling technique, called Efficient Path Profiling (EPP), adapted for the debugging of hardware accelerators generated with HLS. With this technique, the size of the memory used to store control flow traces can be reduced up to 2 orders of magnitude, compared to state-of-the-art
CNN2Gate: an implementation of convolutional neural networks inference on FPGAs with automated design space exploration
ABSTRACT: Convolutional Neural Networks (CNNs) have a major impact on our society, because of the numerous services they provide. These services include, but are not limited to image classification, video analysis, and speech recognition. Recently, the number of researches that utilize FPGAs to implement CNNs are increasing rapidly. This is due to the lower power consumption and easy reconfigurability that are offered by these platforms. Because of the research efforts put into topics, such as architecture, synthesis, and optimization, some new challenges are arising for integrating suitable hardware solutions to high-level machine learning software libraries. This paper introduces an integrated framework (CNN2Gate), which supports compilation of a CNN model for an FPGA target. CNN2Gate is capable of parsing CNN models from several popular high-level machine learning libraries, such as Keras, Pytorch, Caffe2, etc. CNN2Gate extracts computation flow of layers, in addition to weights and biases, and applies a “given” fixed-point quantization. Furthermore, it writes this information in the proper format for the FPGA vendor’s OpenCL synthesis tools that are then used to build and run the project on FPGA. CNN2Gate performs design-space exploration and fits the design on different FPGAs with limited logic resources automatically. This paper reports results of automatic synthesis and design-space exploration of AlexNet and VGG-16 on various Intel FPGA platforms
Image Processing Using FPGAs
This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs
Speed reading in the dark : Accelerating functional encryption for quadratic functions with reprogrammable hardware
Functional encryption is a new paradigm for encryption where decryption does not give the entire plaintext but only some function of it. Functional encryption has great potential in privacy-enhancing technologies but suffers from excessive computational overheads. We introduce the first hardware accelerator that supports functional encryption for quadratic functions. Our accelerator is implemented on a reprogrammable system-on-chip following the hardware/software codesign methogol-ogy. We benchmark our implementation for two privacy-preserving machine learning applications: (1) classification of handwritten digits from the MNIST database and (2) classification of clothes images from the Fashion MNIST database. In both cases, classification is performed with encrypted images. We show that our implementation offers speedups of over 200 times compared to a published software implementation and permits applications which are unfeasible with software-only solutions.Peer reviewe
Speed reading in the dark : Accelerating functional encryption for quadratic functions with reprogrammable hardware
Functional encryption is a new paradigm for encryption where decryption does not give the entire plaintext but only some function of it. Functional encryption has great potential in privacy-enhancing technologies but suffers from excessive computational overheads. We introduce the first hardware accelerator that supports functional encryption for quadratic functions. Our accelerator is implemented on a reprogrammable system-on-chip following the hardware/software codesign methogol-ogy. We benchmark our implementation for two privacy-preserving machine learning applications: (1) classification of handwritten digits from the MNIST database and (2) classification of clothes images from the Fashion MNIST database. In both cases, classification is performed with encrypted images. We show that our implementation offers speedups of over 200 times compared to a published software implementation and permits applications which are unfeasible with software-only solutions.Peer reviewe
High-Speed Hardware Architectures and FPGA Benchmarking of CRYSTALS-Kyber, NTRU, and Saber
Performance in hardware has typically played a significant role in differentiating among leading candidates in cryptographic standardization efforts. Winners of two past NIST cryptographic contests (Rijndael in case of AES and Keccak in case of SHA-3) were ranked consistently among the two fastest candidates when implemented using FPGAs and ASICs. Hardware implementations of cryptographic operations may quite easily outperform software implementations for at least a subset of major performance metrics, such as latency, number of operations per second, power consumption, and energy usage, as well as in terms of security against physical attacks, including side-channel analysis. Using hardware also permits much higher flexibility in trading one subset of these properties for another. This paper presents high-speed hardware architectures for four lattice-based CCA-secure Key Encapsulation Mechanisms (KEMs), representing three NIST PQC finalists: CRYSTALS-Kyber, NTRU (with two distinct variants, NTRU-HPS and NTRU-HRSS), and Saber. We rank these candidates among each other and compare them with all other Round 3 KEMs based on the data from the previously reported work
Recent Advances in Embedded Computing, Intelligence and Applications
The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
The impact of novel people, places, and activities, in tourism
As part of an undergraduate research design class, we measured tourism experiences of 617tourists, during a day, and their potential impact, in a quantitative, cross-sectional manner. In May2023, a total of 30 tourism and experience design students teamed up from Breda University ofApplied Sciences, Netherlands, and Brigham Young University students, United States, andapproached tourists at 45 various tourist hot spots in the Rotterdam and the Amsterdam are
The impact of novel people, places, and activities, in tourism
As part of an undergraduate research design class, we measured tourism experiences of 617tourists, during a day, and their potential impact, in a quantitative, cross-sectional manner. In May2023, a total of 30 tourism and experience design students teamed up from Breda University ofApplied Sciences, Netherlands, and Brigham Young University students, United States, andapproached tourists at 45 various tourist hot spots in the Rotterdam and the Amsterdam are