1,711 research outputs found

    Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO

    Full text link
    The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms in the field of visual-inertial odometry (VIO). While most steps of a VIO pipeline work on visual features, they rely on image data for detection and tracking, of which both steps are well suited for parallelization. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency. Our work first revisits the problem of non-maxima suppression for feature detection specifically on GPUs, and proposes a solution that selects local response maxima, imposes spatial feature distribution, and extracts features simultaneously. Our second contribution introduces an enhanced FAST feature detector that applies the aforementioned non-maxima suppression method. Finally, we compare our method to other state-of-the-art CPU and GPU implementations, where we always outperform all of them in feature tracking and detection, resulting in over 1000fps throughput on an embedded Jetson TX2 platform. Additionally, we demonstrate our work integrated in a VIO pipeline achieving a metric state estimation at ~200fps.Comment: IEEE International Conference on Intelligent Robots and Systems (IROS), 2020. Open-source implementation available at https://github.com/uzh-rpg/vili

    A model-based design flow for embedded vision applications on heterogeneous architectures

    Get PDF
    The ability to gather information from images is straightforward to human, and one of the principal input to understand external world. Computer vision (CV) is the process to extract such knowledge from the visual domain in an algorithmic fashion. The requested computational power to process these information is very high. Until recently, the only feasible way to meet non-functional requirements like performance was to develop custom hardware, which is costly, time-consuming and can not be reused in a general purpose. The recent introduction of low-power and low-cost heterogeneous embedded boards, in which CPUs are combine with heterogeneous accelerators like GPUs, DSPs and FPGAs, can combine the hardware efficiency needed for non-functional requirements with the flexibility of software development. Embedded vision is the term used to identify the application of the aforementioned CV algorithms applied in the embedded field, which usually requires to satisfy, other than functional requirements, also non-functional requirements such as real-time performance, power, and energy efficiency. Rapid prototyping, early algorithm parametrization, testing, and validation of complex embedded video applications for such heterogeneous architectures is a very challenging task. This thesis presents a comprehensive framework that: 1) Is based on a model-based paradigm. Differently from the standard approaches at the state of the art that require designers to manually model the algorithm in any programming language, the proposed approach allows for a rapid prototyping, algorithm validation and parametrization in a model-based design environment (i.e., Matlab/Simulink). The framework relies on a multi-level design and verification flow by which the high-level model is then semi-automatically refined towards the final automatic synthesis into the target hardware device. 2) Relies on a polyglot parallel programming model. The proposed model combines different programming languages and environments such as C/C++, OpenMP, PThreads, OpenVX, OpenCV, and CUDA to best exploit different levels of parallelism while guaranteeing a semi-automatic customization. 3) Optimizes the application performance and energy efficiency through a novel algorithm for the mapping and scheduling of the application 3 tasks on the heterogeneous computing elements of the device. Such an algorithm, called exclusive earliest finish time (XEFT), takes into consideration the possible multiple implementation of tasks for different computing elements (e.g., a task primitive for CPU and an equivalent parallel implementation for GPU). It introduces and takes advantage of the notion of exclusive overlap between primitives to improve the load balancing. This thesis is the result of three years of research activity, during which all the incremental steps made to compose the framework have been tested on real case studie

    Latency and accuracy optimized mobile face detection

    Get PDF
    Abstract. Face detection is a preprocessing step in many computer vision applications. Important factors are accuracy, inference duration, and energy efficiency of the detection framework. Computationally light detectors that execute in real-time are a requirement for many application areas, such as face tracking and recognition. Typical operating platforms in everyday use are smartphones and embedded devices, which have limited computation capacity. The capability of face detectors is comparable to the ability of a human in easy detection tasks. When the conditions change, the challenges become different. Current challenges in face detection include atypically posed and tiny faces. Partially occluded faces and dim or bright environments pose challenges for detection systems. State-of-the-art performance in face detection research employs deep learning methods called neural networks, which loosely imitate the mammalian brain system. The most relevant technologies are convolutional neural networks, which are designed for local feature description. In this thesis, the main computational optimization approach is neural network quantization. The network models were delegated to digital signal processors and graphics processing units. Quantization was shown to reduce the latency of computation substantially. The most energy-efficient inference was achieved through digital signal processor delegation. Multithreading was used for inference acceleration. It reduced the amount of energy consumption per algorithm run.Latenssi- ja tarkkuusoptimoitu kasvontunnistus mobiililaitteilla. Tiivistelmä. Kasvojen ilmaisu on esikäsittelyvaihe monelle konenäön sovellukselle. Tärkeitä kasvoilmaisimen ominaisuuksia ovat tarkkuus, energiatehokkuus ja suoritusnopeus. Monet sovellukset vaativat laskennallisesti kevyitä ilmaisimia, jotka toimivat reaaliajassa. Esimerkkejä sovelluksista ovat kasvojen seuranta- ja tunnistusjärjestelmät. Yleisiä käyttöalustoja ovat älypuhelimet ja sulautetut järjestelmät, joiden laskentakapasiteetti on rajallinen. Kasvonilmaisimien tarkkuus vastaa ihmisen kykyä helpoissa ilmaisuissa. Nykyiset ongelmat kasvojen ilmaisussa liittyvät epätyypillisiin asentoihin ja erityisen pieniin kasvokokoihin. Myös kasvojen osittainen peittyminen, ja pimeät ja kirkkaat ympäristöt, vaikeuttavat ilmaisua. Neuroverkkoja käytetään tekoälyjärjestelmissä, joiden lähtökohtana on ollut mallintaa nisäkkäiden aivojen toimintaa. Konvoluutiopohjaiset neuroverkot ovat erikoistuneet paikallisten piirteiden analysointiin. Tässä opinnäytetyössä käytetty laskennallisen optimoinnin menetelmä on neuroverkkojen kvantisointi. Neuroverkkojen ajo delegoitiin digitaalisille signaalinkäsittely- ja grafiikkasuorittimille. Kvantisoinnin osoitettiin vähentävän laskenta-aikaa huomattavasti ja suurin energiatehokkuus saavutettiin digitaalisen signaaliprosessorin avulla. Suoritusnopeutta lisättiin monisäikeistyksellä, jonka havaittiin vähentävän energiankulutusta

    Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications

    Get PDF
    RÉSUMÉ L’évolution spectaculaire des technologies dans le domaine du matériel et du logiciel a permis l’émergence des nouvelles plateformes parallèles très performantes. Ces plateformes ont marqué le début d’une nouvelle ère de la computation et il est préconisé qu’elles vont rester dans le domaine pour une bonne période de temps. Elles sont présentes déjà dans le domaine du calcul de haute performance (en anglais HPC, High Performance Computer) ainsi que dans le domaine des systèmes embarqués. Récemment, dans ces domaines le concept de calcul hétérogène a été adopté pour atteindre des performances élevées. Ainsi, plusieurs types de processeurs sont utilisés, dont les plus populaires sont les unités centrales de traitement ou CPU (de l’anglais Central Processing Unit) et les processeurs graphiques ou GPU (de l’anglais Graphics Processing Units). La programmation efficace pour ces nouvelles plateformes parallèles amène actuellement non seulement des opportunités mais aussi des défis importants pour les concepteurs. Par conséquent, l’industrie a besoin de l’appui de la communauté de recherche pour assurer le succès de ce nouveau changement de paradigme vers le calcul parallèle. Trois défis principaux présents pour les processeurs GPU massivement parallèles (ou “many-cores”) ainsi que pour les processeurs CPU multi-coeurs sont: (1) la sélection de la meilleure plateforme parallèle pour une application donnée, (2) la sélection de la meilleure stratégie de parallèlisation et (3) le réglage minutieux des performances (ou en anglais performance tuning) pour mieux exploiter les plateformes existantes. Dans ce contexte, l’objectif global de notre projet de recherche est de définir de nouvelles solutions pour aider à la programmation efficace des applications complexes sur les plateformes parallèles modernes. Les principales contributions à la recherche sont: 1. L’évaluation de l’efficacité d’accélération pour plusieurs plateformes parallèles, dans le cas des applications de calcul intensif. 2. Une analyse quantitative des stratégies de parallèlisation et implantation sur les plateformes à base de processeurs CPU multi-cœur ainsi que pour les plateformes à base de processeurs GPU massivement parallèles. 3. La définition et la mise en place d’une approche de réglage de performances (en Anglais performance tuning) pour les plateformes parallèles. Les contributions proposées ont été validées en utilisant des applications réelles illustratives et un ensemble varié de plateformes parallèles modernes.----------ABSTRACT With the technology improvement for both hardware and software, parallel platforms started a new computing era and they are here to stay. Parallel platforms may be found in High Performance Computers (HPC) or embedded computers. Recently, both HPC and embedded computers are moving toward heterogeneous computing platforms. They are employing both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) to achieve the highest performance. Programming efficiently for parallel platforms brings new opportunities but also several challenges. Therefore, industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Parallel programing presents several major challenges. These challenges are equally present whether one programs on a many-core GPU or on a multi-core CPU. Three of the main challenges are: (1) Finding the best platform providing the required acceleration (2) Select the best parallelization strategy (3) Performance tuning to efficiently leverage the parallel platforms. In this context, the overall objective of our research is to propose a new solution helping designers to efficiently program complex applications on modern parallel architectures. The contributions of this thesis are: 1. The evaluation of the efficiency of several target parallel platforms to speedup compute-intensive applications. 2. The quantitative analysis for parallelization and implementation strategies on multicore CPUs and many-core GPUs. 3. The definition and implementation of a new performance tuning framework for heterogeneous parallel platforms. The contributions were validated using real computation intensive applications and modern parallel platform based on multi-core CPU and many-core GPU

    Status and Future Perspectives for Lattice Gauge Theory Calculations to the Exascale and Beyond

    Full text link
    In this and a set of companion whitepapers, the USQCD Collaboration lays out a program of science and computing for lattice gauge theory. These whitepapers describe how calculation using lattice QCD (and other gauge theories) can aid the interpretation of ongoing and upcoming experiments in particle and nuclear physics, as well as inspire new ones.Comment: 44 pages. 1 of USQCD whitepapers

    Relation Networks for Object Detection

    Full text link
    Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era. All state-of-the-art object detection systems still rely on recognizing object instances individually, without exploiting their relations during learning. This work proposes an object relation module. It processes a set of objects simultaneously through interaction between their appearance feature and geometry, thus allowing modeling of their relations. It is lightweight and in-place. It does not require additional supervision and is easy to embed in existing networks. It is shown effective on improving object recognition and duplicate removal steps in the modern object detection pipeline. It verifies the efficacy of modeling object relations in CNN based detection. It gives rise to the first fully end-to-end object detector
    • …
    corecore