420 research outputs found

    Real-time implementation of 3D LiDAR point cloud semantic segmentation in an FPGA

    Get PDF
    Dissertação de mestrado em Informatics EngineeringIn the last few years, the automotive industry has relied heavily on deep learning applications for perception solutions. With data-heavy sensors, such as LiDAR, becoming a standard, the task of developing low-power and real-time applications has become increasingly more challenging. To obtain the maximum computational efficiency, no longer can one focus solely on the software aspect of such applications, while disregarding the underlying hardware. In this thesis, a hardware-software co-design approach is used to implement an inference application leveraging the SqueezeSegV3, a LiDAR-based convolutional neural network, on the Versal ACAP VCK190 FPGA. Automotive requirements carefully drive the development of the proposed solution, with real-time performance and low power consumption being the target metrics. A first experiment validates the suitability of Xilinx’s Vitis-AI tool for the deployment of deep convolutional neural networks on FPGAs. Both the ResNet-18 and SqueezeNet neural networks are deployed to the Zynq UltraScale+ MPSoC ZCU104 and Versal ACAP VCK190 FPGAs. The results show that both networks achieve far more than the real-time requirements while consuming low power. Compared to an NVIDIA RTX 3090 GPU, the performance per watt during both network’s inference is 12x and 47.8x higher and 15.1x and 26.6x higher respectively for the Zynq UltraScale+ MPSoC ZCU104 and the Versal ACAP VCK190 FPGA. These results are obtained with no drop in accuracy in the quantization step. A second experiment builds upon the results of the first by deploying a real-time application containing the SqueezeSegV3 model using the Semantic-KITTI dataset. A framerate of 11 Hz is achieved with a peak power consumption of 78 Watts. The quantization step results in a minimal accuracy and IoU degradation of 0.7 and 1.5 points respectively. A smaller version of the same model is also deployed achieving a framerate of 19 Hz and a peak power consumption of 76 Watts. The application performs semantic segmentation over all the point cloud with a field of view of 360°.Nos últimos anos a indústria automóvel tem cada vez mais aplicado deep learning para solucionar problemas de perceção. Dado que os sensores que produzem grandes quantidades de dados, como o LiDAR, se têm tornado standard, a tarefa de desenvolver aplicações de baixo consumo energético e com capacidades de reagir em tempo real tem-se tornado cada vez mais desafiante. Para obter a máxima eficiência computacional, deixou de ser possível focar-se apenas no software aquando do desenvolvimento de uma aplicação deixando de lado o hardware subjacente. Nesta tese, uma abordagem de desenvolvimento simultâneo de hardware e software é usada para implementar uma aplicação de inferência usando o SqueezeSegV3, uma rede neuronal convolucional profunda, na FPGA Versal ACAP VCK190. São os requisitos automotive que guiam o desenvolvimento da solução proposta, sendo a performance em tempo real e o baixo consumo energético, as métricas alvo principais. Uma primeira experiência valida a aptidão da ferramenta Vitis-AI para a implantação de redes neuronais convolucionais profundas em FPGAs. As redes ResNet-18 e SqueezeNet são ambas implantadas nas FPGAs Zynq UltraScale+ MPSoC ZCU104 e Versal ACAP VCK190. Os resultados mostram que ambas as redes ultrapassam os requisitos de tempo real consumindo pouca energia. Comparado com a GPU NVIDIA RTX 3090, a performance por Watt durante a inferência de ambas as redes é superior em 12x e 47.8x e 15.1x e 26.6x respetivamente na Zynq UltraScale+ MPSoC ZCU104 e na Versal ACAP VCK190. Estes resultados foram obtidos sem qualquer perda de accuracy na etapa de quantização. Uma segunda experiência é feita no seguimento dos resultados da primeira, implantando uma aplicação de inferência em tempo real contendo o modelo SqueezeSegV3 e usando o conjunto de dados Semantic-KITTI. Um framerate de 11 Hz é atingido com um pico de consumo energético de 78 Watts. O processo de quantização resulta numa perda mínima de accuracy e IoU com valores de 0.7 e 1.5 pontos respetivamente. Uma versão mais pequena do mesmo modelo é também implantada, atingindo uma framerate de 19 Hz e um pico de consumo energético de 76 Watts. A aplicação desenvolvida executa segmentação semântica sobre a totalidade das nuvens de pontos LiDAR, com um campo de visão de 360°

    Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing

    Full text link
    One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, on-board power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such low-precision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.Comment: 7 pages, 7 figures, EDHPC 202

    Quantization-Aware NN Layers with High-throughput FPGA Implementation for Edge AI

    Get PDF
    Over the past few years, several applications have been extensively exploiting the advantages of deep learning, in particular when using convolutional neural networks (CNNs). The intrinsic flexibility of such models makes them widely adopted in a variety of practical applications, from medical to industrial. In this latter scenario, however, using consumer Personal Computer (PC) hardware is not always suitable for the potential harsh conditions of the working environment and the strict timing that industrial applications typically have. Therefore, the design of custom FPGA (Field Programmable Gate Array) solutions for network inference is gaining massive attention from researchers and companies as well. In this paper, we propose a family of network architectures composed of three kinds of custom layers working with integer arithmetic with a customizable precision (down to just two bits). Such layers are designed to be effectively trained on classical GPUs (Graphics Processing Units) and then synthesized to FPGA hardware for real-time inference. The idea is to provide a trainable quantization layer, called Requantizer, acting both as a non-linear activation for neurons and a value rescaler to match the desired bit precision. This way, the training is not only quantization-aware, but also capable of estimating the optimal scaling coefficients to accommodate both the non-linear nature of the activations and the constraints imposed by the limited precision. In the experimental section, we test the performance of this kind of model while working both on classical PC hardware and a case-study implementation of a signal peak detection device running on a real FPGA. We employ TensorFlow Lite for training and comparison, and use Xilinx FPGAs and Vivado for synthesis and implementation. The results show an accuracy of the quantized networks close to the floating point version, without the need for representative data for calibration as in other approaches, and performance that is better than dedicated peak detection algorithms. The FPGA implementation is able to run in real time at a rate of four gigapixels per second with moderate hardware resources, while achieving a sustained efficiency of 0.5 TOPS/W (tera operations per second per watt), in line with custom integrated hardware accelerators

    Remote Sensing Data Compression

    Get PDF
    A huge amount of data is acquired nowadays by different remote sensing systems installed on satellites, aircrafts, and UAV. The acquired data then have to be transferred to image processing centres, stored and/or delivered to customers. In restricted scenarios, data compression is strongly desired or necessary. A wide diversity of coding methods can be used, depending on the requirements and their priority. In addition, the types and properties of images differ a lot, thus, practical implementation aspects have to be taken into account. The Special Issue paper collection taken as basis of this book touches on all of the aforementioned items to some degree, giving the reader an opportunity to learn about recent developments and research directions in the field of image compression. In particular, lossless and near-lossless compression of multi- and hyperspectral images still remains current, since such images constitute data arrays that are of extremely large size with rich information that can be retrieved from them for various applications. Another important aspect is the impact of lossless compression on image classification and segmentation, where a reasonable compromise between the characteristics of compression and the final tasks of data processing has to be achieved. The problems of data transition from UAV-based acquisition platforms, as well as the use of FPGA and neural networks, have become very important. Finally, attempts to apply compressive sensing approaches in remote sensing image processing with positive outcomes are observed. We hope that readers will find our book useful and interestin

    Data Collection and Utilization Framework for Edge AI Applications

    Get PDF
    As data being produced by IoT applications continues to explode, there is a growing need to bring computing power closer to the source of the data to meet the response time, power dissipation and cost goals of performance-critical applications in various domains like the Industrial Internet of Things (IIoT), Automated Driving, Medical Imaging or Surveillance among others. This paper proposes a data collection and utilization framework that allows runtime platform and application data to be sent to an edge and cloud system via data collection agents running close to the platform. Agents are connected to a cloud system able to train AI models to improve overall energy efficiency of an AI application executed on an edge platform. In the implementation part, we show the benefits of FPGA-based platform for the task of object detection. Furthermore, we show that it is feasible to collect relevant data from an FPGA platform, transmit the data to a cloud system for processing and receiving feedback actions to execute an edge AI application energy efficiently. As future work, we foresee the possibility to train, deploy and continuously improve a base model able to efficiently adapt the execution of edge applications

    Fast fluorescence lifetime imaging and sensing via deep learning

    Get PDF
    Error on title page – year of award is 2023.Fluorescence lifetime imaging microscopy (FLIM) has become a valuable tool in diverse disciplines. This thesis presents deep learning (DL) approaches to addressing two major challenges in FLIM: slow and complex data analysis and the high photon budget for precisely quantifying the fluorescence lifetimes. DL's ability to extract high-dimensional features from data has revolutionized optical and biomedical imaging analysis. This thesis contributes several novel DL FLIM algorithms that significantly expand FLIM's scope. Firstly, a hardware-friendly pixel-wise DL algorithm is proposed for fast FLIM data analysis. The algorithm has a simple architecture yet can effectively resolve multi-exponential decay models. The calculation speed and accuracy outperform conventional methods significantly. Secondly, a DL algorithm is proposed to improve FLIM image spatial resolution, obtaining high-resolution (HR) fluorescence lifetime images from low-resolution (LR) images. A computational framework is developed to generate large-scale semi-synthetic FLIM datasets to address the challenge of the lack of sufficient high-quality FLIM datasets. This algorithm offers a practical approach to obtaining HR FLIM images quickly for FLIM systems. Thirdly, a DL algorithm is developed to analyze FLIM images with only a few photons per pixel, named Few-Photon Fluorescence Lifetime Imaging (FPFLI) algorithm. FPFLI uses spatial correlation and intensity information to robustly estimate the fluorescence lifetime images, pushing this photon budget to a record-low level of only a few photons per pixel. Finally, a time-resolved flow cytometry (TRFC) system is developed by integrating an advanced CMOS single-photon avalanche diode (SPAD) array and a DL processor. The SPAD array, using a parallel light detection scheme, shows an excellent photon-counting throughput. A quantized convolutional neural network (QCNN) algorithm is designed and implemented on a field-programmable gate array as an embedded processor. The processor resolves fluorescence lifetimes against disturbing noise, showing unparalleled high accuracy, fast analysis speed, and low power consumption.Fluorescence lifetime imaging microscopy (FLIM) has become a valuable tool in diverse disciplines. This thesis presents deep learning (DL) approaches to addressing two major challenges in FLIM: slow and complex data analysis and the high photon budget for precisely quantifying the fluorescence lifetimes. DL's ability to extract high-dimensional features from data has revolutionized optical and biomedical imaging analysis. This thesis contributes several novel DL FLIM algorithms that significantly expand FLIM's scope. Firstly, a hardware-friendly pixel-wise DL algorithm is proposed for fast FLIM data analysis. The algorithm has a simple architecture yet can effectively resolve multi-exponential decay models. The calculation speed and accuracy outperform conventional methods significantly. Secondly, a DL algorithm is proposed to improve FLIM image spatial resolution, obtaining high-resolution (HR) fluorescence lifetime images from low-resolution (LR) images. A computational framework is developed to generate large-scale semi-synthetic FLIM datasets to address the challenge of the lack of sufficient high-quality FLIM datasets. This algorithm offers a practical approach to obtaining HR FLIM images quickly for FLIM systems. Thirdly, a DL algorithm is developed to analyze FLIM images with only a few photons per pixel, named Few-Photon Fluorescence Lifetime Imaging (FPFLI) algorithm. FPFLI uses spatial correlation and intensity information to robustly estimate the fluorescence lifetime images, pushing this photon budget to a record-low level of only a few photons per pixel. Finally, a time-resolved flow cytometry (TRFC) system is developed by integrating an advanced CMOS single-photon avalanche diode (SPAD) array and a DL processor. The SPAD array, using a parallel light detection scheme, shows an excellent photon-counting throughput. A quantized convolutional neural network (QCNN) algorithm is designed and implemented on a field-programmable gate array as an embedded processor. The processor resolves fluorescence lifetimes against disturbing noise, showing unparalleled high accuracy, fast analysis speed, and low power consumption

    The CCSDS 123.0-B-2 Low-Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression Standard: A comprehensive review

    Get PDF
    The Consultative Committee for Space Data Systems (CCSDS) published the CCSDS 123.0-B-2, “Low- Complexity Lossless and Near-Lossless Multispectral and Hyperspectral Image Compression” standard. This standard extends the previous issue, CCSDS 123.0-B-1, which supported only lossless compression, while maintaining backward compatibility. The main novelty of the new issue is support for near-lossless compression, i.e., lossy compression with user-defined absolute and/or relative error limits in the reconstructed images. This new feature is achieved via closed-loop quantization of prediction errors. Two further additions arise from the new near lossless support: first, the calculation of predicted sample values using sample representatives that may not be equal to the reconstructed sample values, and, second, a new hybrid entropy coder designed to provide enhanced compression performance for low-entropy data, prevalent when non lossless compression is used. These new features enable significantly smaller compressed data volumes than those achievable with CCSDS 123.0-B-1 while controlling the quality of the decompressed images. As a result, larger amounts of valuable information can be retrieved given a set of bandwidth and energy consumption constraints

    Evaluation and implementation of an auto-encoder for compression of satellite images in the ScOSA project

    Get PDF
    The thesis evaluates the efficiency of various autoencoder neural networks for image compression regarding satellite imagery. The results highlight the evaluation and implementation of autoencoder architectures and the procedures required to deploy neural networks to reliable embedded devices. The developed autoencoders evaluated, targeting a ZYNQ 7020 FPGA (Field Programmable Gate Array) and a ZU7EV FPGA

    Evaluation and implementation of an auto-encoder for compression of satellite images in the ScOSA project

    Get PDF
    The thesis evaluates the efficiency of various autoencoder neural networks for image compression regarding satellite imagery. The results highlight the evaluation and implementation of autoencoder architectures and the procedures required to deploy neural networks to reliable embedded devices. The developed autoencoders evaluated, targeting a ZYNQ 7020 FPGA (Field Programmable Gate Array) and a ZU7EV FPGA

    Tiny Machine Learning Environment: Enabling Intelligence on Constrained Devices

    Get PDF
    Running machine learning algorithms (ML) on constrained devices at the extreme edge of the network is problematic due to the computational overhead of ML algorithms, available resources on the embedded platform, and application budget (i.e., real-time requirements, power constraints, etc.). This required the development of specific solutions and development tools for what is now referred to as TinyML. In this dissertation, we focus on improving the deployment and performance of TinyML applications, taking into consideration the aforementioned challenges, especially memory requirements. This dissertation contributed to the construction of the Edge Learning Machine environment (ELM), a platform-independent open-source framework that provides three main TinyML services, namely shallow ML, self-supervised ML, and binary deep learning on constrained devices. In this context, this work includes the following steps, which are reflected in the thesis structure. First, we present the performance analysis of state-of-the-art shallow ML algorithms including dense neural networks, implemented on mainstream microcontrollers. The comprehensive analysis in terms of algorithms, hardware platforms, datasets, preprocessing techniques, and configurations shows similar performance results compared to a desktop machine and highlights the impact of these factors on overall performance. Second, despite the assumption that TinyML only permits models inference provided by the scarcity of resources, we have gone a step further and enabled self-supervised on-device training on microcontrollers and tiny IoT devices by developing the Autonomous Edge Pipeline (AEP) system. AEP achieves comparable accuracy compared to the typical TinyML paradigm, i.e., models trained on resource-abundant devices and then deployed on microcontrollers. Next, we present the development of a memory allocation strategy for convolutional neural networks (CNNs) layers, that optimizes memory requirements. This approach reduces the memory footprint without affecting accuracy nor latency. Moreover, e-skin systems share the main requirements of the TinyML fields: enabling intelligence with low memory, low power consumption, and low latency. Therefore, we designed an efficient Tiny CNN architecture for e-skin applications. The architecture leverages the memory allocation strategy presented earlier and provides better performance than existing solutions. A major contribution of the thesis is given by CBin-NN, a library of functions for implementing extremely efficient binary neural networks on constrained devices. The library outperforms state of the art NN deployment solutions by drastically reducing memory footprint and inference latency. All the solutions proposed in this thesis have been implemented on representative devices and tested in relevant applications, of which results are reported and discussed. The ELM framework is open source, and this work is clearly becoming a useful, versatile toolkit for the IoT and TinyML research and development community
    corecore