2,826 research outputs found
Implementation of the K-Means Algorithm on Heterogeneous Devices: A Use Case Based on an Industrial Dataset
This paper presents and analyzes a heterogeneous implementation of an industrial use case based on K-means that targets symmetric multiprocessing (SMP), GPUs and FPGAs. We present how the application can be optimized from an algorithmic point of view and how this optimization performs on two heterogeneous platforms. The presented implementation relies on the OmpSs programming model, which introduces a simplified pragma-based syntax for the communication between the main processor and the accelerators. Performance improvement can be achieved by the programmer explicitly specifying the data memory accesses or copies. As expected, the newer SMP+GPU system studied is more powerful than the older SMP+FPGA system. However the latter is enough to fulfill the requirements of our use case and we show that uses less energy when considering only the active power of the execution.This work is partially supported by the European Union H2020 project AXIOM (grant
agreement n. 645496), HiPEAC (grant agreement n. 687698), and Mont-Blanc (grant
agreements n. 288777, 610402 and 671697), the Spanish Government Programa Severo
Ochoa (SEV-2015-0493), the Spanish Ministry of Science and Technology (TIN2015-
65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat
de Catalunya, under project MPEXPAR: Models de Programaci´o i Entorns d’Execució
Paral·lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Efficient Implementation on Low-Cost SoC-FPGAs of TLSv1.2 Protocol with ECC_AES Support for Secure IoT Coordinators
Security management for IoT applications is a critical research field, especially when taking into account the performance variation over the very different IoT devices. In this paper, we present high-performance client/server coordinators on low-cost SoC-FPGA devices for secure IoT data collection. Security is ensured by using the Transport Layer Security (TLS) protocol based on the TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 cipher suite. The hardware architecture of the proposed coordinators is based on SW/HW co-design, implementing within the hardware accelerator core Elliptic Curve Scalar Multiplication (ECSM), which is the core operation of Elliptic Curve Cryptosystems (ECC). Meanwhile, the control of the overall TLS scheme is performed in software by an ARM Cortex-A9 microprocessor. In fact, the implementation of the ECC accelerator core around an ARM microprocessor allows not only the improvement of ECSM execution but also the performance enhancement of the overall cryptosystem. The integration of the ARM processor enables to exploit the possibility of embedded Linux features for high system flexibility. As a result, the proposed ECC accelerator requires limited area, with only 3395 LUTs on the Zynq device used to perform high-speed, 233-bit ECSMs in 413 µs, with a 50 MHz clock. Moreover, the generation of a 384-bit TLS handshake secret key between client and server coordinators requires 67.5 ms on a low cost Zynq 7Z007S device
General purpose readout board {\pi} LUP: overview and results
This work gives an overview of the PCI-Express board LUP, focusing on
the motivation that led to its development, the technological choices adopted
and its performance. The LUP card was designed by INFN and University of
Bologna as a readout interface candidate to be used after the Phase-II upgrade
of the Pixel Detector of the ATLAS and CMS experiments at LHC. The same team in
Bologna is also responsible for the design and commissioning of the ReadOut
Driver (ROD) board - currently implemented in all the four layers of the ATLAS
Pixel Detector (Insertable B-Layer, B-Layer, Layer-1 and Layer-2) - and
acquired in the past years expertise on the ATLAS readout chain and the
problematics arising in such experiments. Although the LUP was designed to
fulfill a specific task, it is highly versatile and might fit a wide variety of
applications, some of which will be discussed in this work. Two
7-generation Xilinx FPGAs are mounted on the board: a Zynq-7 with an
embedded dual core ARM Processor and a Kintex-7. The latter features sixteen
12.5Gbps transceivers, allowing the board to interface easily to any other
electronic board, either electrically and/or optically, at the current
bandwidth of the experiments for LHC. Many data-transmission protocols have
been tested at different speeds, results will be discussed later in this work.
Two batches of LUP boards have been fabricated and tested, two boards in
the first batch (version 1.0) and four boards in the second batch (version
1.1), encapsulating all the patches and improvements required by the first
version.Comment: 6 pages, 10 figures, 21th Real Time Conference, winner of "2018 NPSS
Student Paper Award Second Prize
Quantifying the latency benefits of near-edge and in-network FPGA acceleration
Transmitting data to cloud datacenters in distributed IoT applications introduces significant communication latency, but is often the only feasible solution when source nodes are computationally limited. To address latency concerns, cloudlets, in-network computing, and more capable edge nodes are all being explored as a way of moving processing capability towards the edge of the network. Hardware acceleration using Field Programmable Gate Arrays (FPGAs) is also seeing increased interest due to reduced computation latency and improved efficiency. This paper evaluates the the implications of these offloading approaches using a case study neural network based image classification application, quantifying both the computation and communication latency resulting from different platform choices. We consider communication latency including the ingestion of packets for processing on the target platform, showing that this varies significantly with the choice of platform. We demonstrate that emerging in-network accelerator approaches offer much improved and predictable performance as well as better scaling to support multiple data sources
- …