64 research outputs found

    Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions

    Get PDF
    In the past decade, Convolutional Neural Networks (CNNs) have demonstrated state-of-the-art performance in various Artificial Intelligence tasks. To accelerate the experimentation and development of CNNs, several software frameworks have been released, primarily targeting power-hungry CPUs and GPUs. In this context, reconfigurable hardware in the form of FPGAs constitutes a potential alternative platform that can be integrated in the existing deep learning ecosystem to provide a tunable balance between performance, power consumption and programmability. In this paper, a survey of the existing CNN-to-FPGA toolflows is presented, comprising a comparative study of their key characteristics which include the supported applications, architectural choices, design space exploration methods and achieved performance. Moreover, major challenges and objectives introduced by the latest trends in CNN algorithmic research are identified and presented. Finally, a uniform evaluation methodology is proposed, aiming at the comprehensive, complete and in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal, 201

    Designing Neural Networks for Real-Time Systems

    Full text link
    Artificial Neural Networks (ANNs) are increasingly being used within safety-critical Cyber-Physical Systems (CPSs). They are often co-located with traditional embedded software, and may perform advisory or control-based roles. It is important to validate both the timing and functional correctness of these systems. However, most approaches in the literature consider guaranteeing only the functionality of ANN based controllers. This issue stems largely from the implementation strategies used within common neural network frameworks -- their underlying source code is often simply unsuitable for formal techniques such as static timing analysis. As a result, developers of safety-critical CPS must rely on informal techniques such as measurement based approaches to prove correctness, techniques that provide weak guarantees at best. In this work we address this challenge. We propose a design pipeline whereby neural networks trained using the popular deep learning framework Keras are compiled to functionally equivalent C code. This C code is restricted to simple constructs that may be analysed by existing static timing analysis tools. As a result, if compiled to a suitable time-predictable platform all execution bounds may be statically derived. To demonstrate the benefits of our approach we execute an ANN trained to drive an autonomous vehicle around a race track. We compile the ANN to the Patmos time-predictable controller, and show that we can derive worst case execution timings.Comment: 4 pages, 2 figures. IEEE Embedded Systems Letters, 202

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Real-time 3D object detection and SLAM fusion in a low-cost LiDAR test vehicle setup

    Get PDF
    Recently released research about deep learning applications related to perception for autonomous driving focuses heavily on the usage of LiDAR point cloud data as input for the neural networks, highlighting the importance of LiDAR technology in the field of Autonomous Driving (AD). In this sense, a great percentage of the vehicle platforms used to create the datasets released for the development of these neural networks, as well as some AD commercial solutions available on the market, heavily invest in an array of sensors, including a large number of sensors as well as several sensor modalities. However, these costs create a barrier to entry for low-cost solutions for the performance of critical perception tasks such as Object Detection and SLAM. This paper explores current vehicle platforms and proposes a low-cost, LiDAR-based test vehicle platform capable of running critical perception tasks (Object Detection and SLAM) in real time. Additionally, we propose the creation of a deep learning-based inference model for Object Detection deployed in a resource-constrained device, as well as a graph-based SLAM implementation, providing important considerations, explored while taking into account the real-time processing requirement and presenting relevant results demonstrating the usability of the developed work in the context of the proposed low-cost platform

    Deep Learning Algorithms on Embedded Devices

    Get PDF
    Táto práca popisuje v súčastnosti široko používané architektúry a modely pre Hlboké Učenie, riešiace úlohu detekcie a klasifikácie objektov vo videu. Dôraz tu bude kladený na ich použiteľnosť na vstavaných zariadeniach. Postupne preberieme kroky a odvôvodňovanie pri výbere najlepšieho vstavaného systému pre našu aplikáciu. Ukážková aplikáci pozostáva hlavne z detekcie vozidiel a detekcie voľných parkovacích miest s využitím algoritmov Hlbokého Učenia. Táto aplikácia umožňuje monitorovať počet vozidiel, nachádzajúcich sa na parkovisku a zároveň rozhodnúť, či sa nachádzajú na prakovacom mieste alebo nie. Následne tu budú prebrané kroky nutné ku konfigurácii zariadenia s dôrazom na optimalizáciu hardvéru pre dosiahnutie čo najväčšej rýchlosti. V ďaľšej časti bude poskytnuté porovnanie vybraných modelov, ktoré budú porovnávané hlavne v kategóriách ako rýchlosť alebo F1 skóre. Najlepší kandidát bude použitý na riešenie našej aplikácie a následné testovanie jej vlastností s názvom Inteligentné parkovisko.This paper describes currently widely used Deep Learning architectures and methods for object detection and classification in video, with intention of using them on embedded systems. We will cover steps and reasoning when choosing the most appropriate embedded hardware for our application. Our test application consists of vehicle detection and free parking space detection using Deep learning methods, all wrapped under name Smart car park. This application provides monitoring of vehicle presence in car park and if they occupy parking spot or not. All this is expected to be done using embedded device. Later, there will be covered configuration steps for our embedded device with emphasis on hardware optimization for speed. We will provide comparison of available inference models, which will be rated mostly in categories like speed or F1 score, which have the biggest impact in our application. The best candidate will be selected and used for testing of our application.
    corecore