119 research outputs found
Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package
International audienceFPGA devices have been proving to be good candidates to accelerate applications from different research topics. For instance, machine learning applications such as K-Means clustering usually relies on large amount of data to be processed, and, despite the performance offered by other architectures, FPGAs can offer better energy efficiency. With that in mind, Intel ® has launched a platform that integrates a multicore and an FPGA in the same package, enabling low latency and coherent fine-grained data offload. In this paper, we present a parallel implementation of the K-Means clustering algorithm, for this novel platform, using OpenCL language, and compared it against other platforms. We found that the CPU+FPGA platform was more energy efficient than the CPU-only approach from 70.71% to 85.92%, with Standard and Tiny input sizes respectively, and up to 68.21% of performance improvement was obtained with Tiny input size. Furthermore, it was up to 7.2× more energy efficient than an Intel® Xeon Phi ™, 21.5× than a cluster of Raspberry Pi boards, and 3.8× than the low-power MPPA-256 architecture, when the Standard input size was used
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
An FPGA-based controller for collaborative robotics
The use of robots is becoming more common in society. Industrial robots are being developed to work with people, and lower-force collaborative robots are being developed to help people in their everyday lives. These may need fast and sophisticated motion control and behavioral algorithms, but are expected to be more compact and lower cost. This paper proposes a processor plus FPGA solution for the control systems for such robots, where the FPGA performs all real-time tasks, freeing the processor to run lower-frequency high level control and interface to other devices such as camera systems. A demonstrator robot is designed, combining multi-axis motion control with 3D robot vision
BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs
Deep neural network (DNN) inference using reduced integer precision has been
shown to achieve significant improvements in memory utilization and compute
throughput with little or no accuracy loss compared to full-precision
floating-point. Modern FPGA-based DNN inference relies heavily on the on-chip
block RAM (BRAM) for model storage and the digital signal processing (DSP) unit
for implementing the multiply-accumulate (MAC) operation, a fundamental DNN
primitive. In this paper, we enhance the existing BRAM to also compute MAC by
proposing BRAMAC (Compute-in-AM
rchitectures for
ultiply-cumulate). BRAMAC supports
2's complement 2- to 8-bit MAC in a small dummy BRAM array using a hybrid
bit-serial & bit-parallel data flow. Unlike previous compute-in-BRAM
architectures, BRAMAC allows read/write access to the main BRAM array while
computing in the dummy BRAM array, enabling both persistent and tiling-based
DNN inference. We explore two BRAMAC variants: BRAMAC-2SA (with 2 synchronous
dummy arrays) and BRAMAC-1DA (with 1 double-pumped dummy array).
BRAMAC-2SA/BRAMAC-1DA can boost the peak MAC throughput of a large Arria-10
FPGA by 2.6/2.1, 2.3/2.0, and
1.9/1.7 for 2-bit, 4-bit, and 8-bit precisions, respectively at
the cost of 6.8%/3.4% increase in the FPGA core area. By adding
BRAMAC-2SA/BRAMAC-1DA to a state-of-the-art tiling-based DNN accelerator, an
average speedup of 2.05/1.7 and 1.33/1.52 can
be achieved for AlexNet and ResNet-34, respectively across different model
precisions.Comment: 11 pages, 13 figures, 3 tables, FCCM conference 202
- …