110 research outputs found

    Real-Time Unsupervised Object Localization on the Edge for Airport Video Surveillance.

    Get PDF
    Object localization is vital in computer vision to solve object detection or classification problems. Typically, this task is performed on expensive GPU devices, but edge computing is gaining importance in real-time applications. In this work, we propose a real-time implementation for unsupervised object localization using a low-power device for airport video surveillance. We automatically find regions of objects in video using a region proposal network (RPN) together with an optical flow region proposal (OFRP) based on optical flow maps between frames. In addition, we study the deployment of our solution on an embedded architecture, i.e. a Jetson AGX Xavier, using simultaneously CPU, GPU and specific hardware accelerators. Also, three different data representations (FP32, FP16 and INT8) are employed for the RPN. Obtained results show that optimizations can improve up to 4.1× energy consumption and 2.2× execution time while maintaining good accuracy with respect to the baseline model.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    QR code detection under ROS implemented on the GPU

    Get PDF
    Tato diplomová práce se zabývá vývojem a implementací algoritmu pro detekci QR kódů s integrací do platformy ROS a výpočty běžícími na grafické kartě. Z rešerše současně dostupných nástrojů a technik je vybrán vhodný postup a algoritmus je napsán jako modul v programovacím jazyce Python, který je snadno integrovatelný do ROS. Ke zprostředkování výpočtů na vícejádrovém hardware, jako jsou grafické karty či vícejádrové procesory, je využita knihovna OpenCL.This master's thesis deals with the design and implementation of a QR code detection algorithm under the ROS platform with computations running on a graphical processing unit. Through a comparative survey of available tools and techniques, a suitable approach is chosen and the algorithm is written as a module in the Python programming language, ready to be implemented under the ROS platform. The OpenCL parallel computing platform is used to facilitate parallel computation on multi-core hardware, such as graphical processing units or multi-core CPUs.

    Latency and accuracy optimized mobile face detection

    Get PDF
    Abstract. Face detection is a preprocessing step in many computer vision applications. Important factors are accuracy, inference duration, and energy efficiency of the detection framework. Computationally light detectors that execute in real-time are a requirement for many application areas, such as face tracking and recognition. Typical operating platforms in everyday use are smartphones and embedded devices, which have limited computation capacity. The capability of face detectors is comparable to the ability of a human in easy detection tasks. When the conditions change, the challenges become different. Current challenges in face detection include atypically posed and tiny faces. Partially occluded faces and dim or bright environments pose challenges for detection systems. State-of-the-art performance in face detection research employs deep learning methods called neural networks, which loosely imitate the mammalian brain system. The most relevant technologies are convolutional neural networks, which are designed for local feature description. In this thesis, the main computational optimization approach is neural network quantization. The network models were delegated to digital signal processors and graphics processing units. Quantization was shown to reduce the latency of computation substantially. The most energy-efficient inference was achieved through digital signal processor delegation. Multithreading was used for inference acceleration. It reduced the amount of energy consumption per algorithm run.Latenssi- ja tarkkuusoptimoitu kasvontunnistus mobiililaitteilla. Tiivistelmä. Kasvojen ilmaisu on esikäsittelyvaihe monelle konenäön sovellukselle. Tärkeitä kasvoilmaisimen ominaisuuksia ovat tarkkuus, energiatehokkuus ja suoritusnopeus. Monet sovellukset vaativat laskennallisesti kevyitä ilmaisimia, jotka toimivat reaaliajassa. Esimerkkejä sovelluksista ovat kasvojen seuranta- ja tunnistusjärjestelmät. Yleisiä käyttöalustoja ovat älypuhelimet ja sulautetut järjestelmät, joiden laskentakapasiteetti on rajallinen. Kasvonilmaisimien tarkkuus vastaa ihmisen kykyä helpoissa ilmaisuissa. Nykyiset ongelmat kasvojen ilmaisussa liittyvät epätyypillisiin asentoihin ja erityisen pieniin kasvokokoihin. Myös kasvojen osittainen peittyminen, ja pimeät ja kirkkaat ympäristöt, vaikeuttavat ilmaisua. Neuroverkkoja käytetään tekoälyjärjestelmissä, joiden lähtökohtana on ollut mallintaa nisäkkäiden aivojen toimintaa. Konvoluutiopohjaiset neuroverkot ovat erikoistuneet paikallisten piirteiden analysointiin. Tässä opinnäytetyössä käytetty laskennallisen optimoinnin menetelmä on neuroverkkojen kvantisointi. Neuroverkkojen ajo delegoitiin digitaalisille signaalinkäsittely- ja grafiikkasuorittimille. Kvantisoinnin osoitettiin vähentävän laskenta-aikaa huomattavasti ja suurin energiatehokkuus saavutettiin digitaalisen signaaliprosessorin avulla. Suoritusnopeutta lisättiin monisäikeistyksellä, jonka havaittiin vähentävän energiankulutusta

    High performance video processing in cloud data centres

    Get PDF
    Mobile phones and affordable cameras are generating large amounts of video data. This data holds information regarding several activities and incidents. Video analytics systems have been introduced to extract valuable information from this data. However, most of these systems are expensive, require human supervision and are time consuming. The probability of extracting inaccurate information is also high due to human involvement. We have addressed these challenges by proposing a cloud based high performance video analytics platform. This platform attempts to minimize human intervention, reduce computation time and enables the processing of a large number of video streams. It achieves high performance by optimizing the occupancy of GPU resources in cloud and minimizing the data transfer by concurrently processing a large number of video streams. The proposed video processing platform is evaluated in three stages. The first evaluation was performed at the cloud level in order to evaluate the scalability of the platform. This evaluation includes fetching and distributing video streams and efficiently utilizing available resources within the cloud. The second valuation was performed at the individual cloud nodes. This evaluation includes measuring the occupancy level, effect of data transfer and the extent of concurrency achieved at each node. The third evaluation was performed at the frame level in order to determine the performance of object recognition algorithms. To measure this, compute intensive tasks of the Local Binary Pattern (LBP) algorithm have been ported on to the GPU resources. The platform proved to be very scalable with high throughput and performance when tested on a large number of video streams with increasing number of nodes

    High performance video processing in cloud data centres

    Get PDF
    Mobile phones and affordable cameras are generating large amounts of video data. This data holds information regarding several activities and incidents. Video analytics systems have been introduced to extract valuable information from this data. However, most of these systems are expensive, require human supervision and are time consuming. The probability of extracting inaccurate information is also high due to human involvement. We have addressed these challenges by proposing a cloud based high performance video analytics platform. This platform attempts to minimize human intervention, reduce computation time and enables the processing of a large number of video streams. It achieves high performance by optimizing the occupancy of GPU resources in cloud and minimizing the data transfer by concurrently processing a large number of video streams. The proposed video processing platform is evaluated in three stages. The first evaluation was performed at the cloud level in order to evaluate the scalability of the platform. This evaluation includes fetching and distributing video streams and efficiently utilizing available resources within the cloud. The second valuation was performed at the individual cloud nodes. This evaluation includes measuring the occupancy level, effect of data transfer and the extent of concurrency achieved at each node. The third evaluation was performed at the frame level in order to determine the performance of object recognition algorithms. To measure this, compute intensive tasks of the Local Binary Pattern (LBP) algorithm have been ported on to the GPU resources. The platform proved to be very scalable with high throughput and performance when tested on a large number of video streams with increasing number of nodes

    Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality

    Get PDF
    Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201

    Un framework pour l'exécution efficace d'applications sur GPU et CPU+GPU

    Get PDF
    Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    Smile Recognition Implementation on Embedded Platforms

    Get PDF
    In this work, our focus is on the real-time development of a smile recognition system on low resource computational devices utilizing deep learning algorithms which could be simply further developed to address issues in mentioned areas. We have primarily used the Looking at People (LAP) dataset for training and testing various neural network architectures. Images in this dataset have been pre-processed at first by acts of cropping around the facial area and face alignment. Then six pre-trained deep learning network architectures were finetuned for this purpose. The fine-tuned models were deployed on Nvidia’s embedded platform and we were employing an asynchronous design to provide smoother frame rate through parallelization and multithreading. Accuracy and speed of these models were retrieved letting us compare them to each other and choose the most suitable ones for this task. Our research shows that modern low complexity architectures could almost reach the older or bulkier ones’ performance
    corecore