64 research outputs found
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Designing Neural Networks for Real-Time Systems
Artificial Neural Networks (ANNs) are increasingly being used within
safety-critical Cyber-Physical Systems (CPSs). They are often co-located with
traditional embedded software, and may perform advisory or control-based roles.
It is important to validate both the timing and functional correctness of these
systems. However, most approaches in the literature consider guaranteeing only
the functionality of ANN based controllers. This issue stems largely from the
implementation strategies used within common neural network frameworks -- their
underlying source code is often simply unsuitable for formal techniques such as
static timing analysis. As a result, developers of safety-critical CPS must
rely on informal techniques such as measurement based approaches to prove
correctness, techniques that provide weak guarantees at best. In this work we
address this challenge. We propose a design pipeline whereby neural networks
trained using the popular deep learning framework Keras are compiled to
functionally equivalent C code. This C code is restricted to simple constructs
that may be analysed by existing static timing analysis tools. As a result, if
compiled to a suitable time-predictable platform all execution bounds may be
statically derived. To demonstrate the benefits of our approach we execute an
ANN trained to drive an autonomous vehicle around a race track. We compile the
ANN to the Patmos time-predictable controller, and show that we can derive
worst case execution timings.Comment: 4 pages, 2 figures. IEEE Embedded Systems Letters, 202
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
Real-time 3D object detection and SLAM fusion in a low-cost LiDAR test vehicle setup
Recently released research about deep learning applications related to perception for autonomous driving focuses heavily on the usage of LiDAR point cloud data as input for the neural networks, highlighting the importance of LiDAR technology in the field of Autonomous Driving (AD). In this sense, a great percentage of the vehicle platforms used to create the datasets released for the development of these neural networks, as well as some AD commercial solutions available on the market, heavily invest in an array of sensors, including a large number of sensors as well as several sensor modalities. However, these costs create a barrier to entry for low-cost solutions for the performance of critical perception tasks such as Object Detection and SLAM. This paper explores current vehicle platforms and proposes a low-cost, LiDAR-based test vehicle platform capable of running critical perception tasks (Object Detection and SLAM) in real time. Additionally, we propose the creation of a deep learning-based inference model for Object Detection deployed in a resource-constrained device, as well as a graph-based SLAM implementation, providing important considerations, explored while taking into account the real-time processing requirement and presenting relevant results demonstrating the usability of the developed work in the context of the proposed low-cost platform
Deep Learning Algorithms on Embedded Devices
Táto práca popisuje v súčastnosti široko používané architektúry a modely pre Hlboké Učenie, riešiace úlohu detekcie a klasifikácie objektov vo videu. Dôraz tu bude kladený na ich použiteľnosť na vstavaných zariadeniach. Postupne preberieme kroky a odvôvodňovanie pri výbere najlepšieho vstavaného systému pre našu aplikáciu. Ukážková aplikáci pozostáva hlavne z detekcie vozidiel a detekcie voľných parkovacích miest s využitím algoritmov Hlbokého Učenia. Táto aplikácia umožňuje monitorovať počet vozidiel, nachádzajúcich sa na parkovisku a zároveň rozhodnúť, či sa nachádzajú na prakovacom mieste alebo nie. Následne tu budú prebrané kroky nutné ku konfigurácii zariadenia s dôrazom na optimalizáciu hardvéru pre dosiahnutie čo najväčšej rýchlosti. V ďaľšej časti bude poskytnuté porovnanie vybraných modelov, ktoré budú porovnávané hlavne v kategóriách ako rýchlosť alebo F1 skóre. Najlepší kandidát bude použitý na riešenie našej aplikácie a následné testovanie jej vlastností s názvom Inteligentné parkovisko.This paper describes currently widely used Deep Learning architectures and methods for object detection and classification in video, with intention of using them on embedded systems. We will cover steps and reasoning when choosing the most appropriate embedded hardware for our application. Our test application consists of vehicle detection and free parking space detection using Deep learning methods, all wrapped under name Smart car park. This application provides monitoring of vehicle presence in car park and if they occupy parking spot or not. All this is expected to be done using embedded device. Later, there will be covered configuration steps for our embedded device with emphasis on hardware optimization for speed. We will provide comparison of available inference models, which will be rated mostly in categories like speed or F1 score, which have the biggest impact in our application. The best candidate will be selected and used for testing of our application.
- …