8 research outputs found

    Event Transformer+. A multi-purpose solution for efficient event data processing

    Get PDF
    Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer+, that improves our seminal work evtprev EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream predictions (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU

    CarSNN: An Efficient Spiking Neural Network for Event-Based Autonomous Cars on the Loihi Neuromorphic Research Processor

    Get PDF
    Autonomous Driving (AD) related features provide new forms of mobility that are also beneficial for other kind of intelligent and autonomous systems like robots, smart transportation, and smart industries. For these applications, the decisions need to be made fast and in real-time. Moreover, in the quest for electric mobility, this task must follow low power policy, without affecting much the autonomy of the mean of transport or the robot. These two challenges can be tackled using the emerging Spiking Neural Networks (SNNs). When deployed on a specialized neuromorphic hardware, SNNs can achieve high performance with low latency and low power consumption. In this paper, we use an SNN connected to an event-based camera for facing one of the key problems for AD, i.e., the classification between cars and other objects. To consume less power than traditional frame-based cameras, we use a Dynamic Vision Sensor (DVS). The experiments are made following an offline supervised learning rule, followed by mapping the learnt SNN model on the Intel Loihi Neuromorphic Research Chip. Our best experiment achieves an accuracy on offline implementation of 86%, that drops to 83% when it is ported onto the Loihi Chip. The Neuromorphic Hardware implementation has maximum 0.72 ms of latency for every sample, and consumes only 310 mW. To the best of our knowledge, this work is the first implementation of an event-based car classifier on a Neuromorphic Chip.Comment: Accepted for publication at IJCNN 202

    Graph-Based Spatio-Temporal Feature Learning for Neuromorphic Vision Sensing

    Get PDF
    Neuromorphic vision sensing (NVS) devices represent visual information as sequences of asynchronous discrete events (a.k.a., “spikes”) in response to changes in scene reflectance. Unlike conventional active pixel sensing (APS), NVS allows for significantly higher event sampling rates at substantially increased energy efficiency and robustness to illumination changes. However, feature representation for NVS is far behind its APS-based counterparts, resulting in lower performance in high-level computer vision tasks. To fully utilize its sparse and asynchronous nature, we propose a compact graph representation for NVS, which allows for end-to-end learning with graph convolution neural networks. We couple this with a novel end-to-end feature learning framework that accommodates both appearance-based and motion-based tasks. The core of our framework comprises a spatial feature learning module, which utilizes residual-graph convolutional neural networks (RG-CNN), for end-to-end learning of appearance-based features directly from graphs. We extend this with our proposed Graph2Grid block and temporal feature learning module for efficiently modelling temporal dependencies over multiple graphs and a long temporal extent. We show how our framework can be configured for object classification, action recognition and action similarity labeling. Importantly, our approach preserves the spatial and temporal coherence of spike events, while requiring less computation and memory. The experimental validation shows that our proposed framework outperforms all recent methods on standard datasets. Finally, to address the absence of large real-world NVS datasets for complex recognition tasks, we introduce, evaluate and make available the American Sign Language letters (ASL-DVS), as well as human action dataset (UCF101-DVS, HMDB51-DVS and ASLAN-DVS)

    Optical flow estimation using the Fisher-Rao metric

    Get PDF
    The optical flow in an event camera is estimated using measurements in the Address Event Representation (AER). Each measurement consists of a pixel address and the time at which a change in the pixel value equalled a given fixed threshold. The measurements in a small region of the pixel array and within a given window in time are approximated by a probability distribution defined on a finite set. The distributions obtained in this way form a three dimensional family parameterized by the pixel addresses and by time. Each parameter value has an associated Fisher-Rao matrix obtained from the Fisher-Rao metric for the parameterized family of distributions. The optical flow vector at a given pixel and at a given time is obtained from the eigenvector of the associated Fisher-Rao matrix with the least eigenvalue. The Fisher-Rao algorithm for estimating optical flow is tested on eight datasets, of which six have ground truth optical flow. It is shown that the Fisher-Rao algorithm performs well in comparison with two state of the art algorithms for estimating optical flow from AER measurements

    Graph-based Spatial-temporal Feature Learning for Neuromorphic Vision Sensing

    Full text link
    Neuromorphic vision sensing (NVS)\ devices represent visual information as sequences of asynchronous discrete events (a.k.a., "spikes") in response to changes in scene reflectance. Unlike conventional active pixel sensing (APS), NVS allows for significantly higher event sampling rates at substantially increased energy efficiency and robustness to illumination changes. However, feature representation for NVS is far behind its APS-based counterparts, resulting in lower performance in high-level computer vision tasks. To fully utilize its sparse and asynchronous nature, we propose a compact graph representation for NVS, which allows for end-to-end learning with graph convolution neural networks. We couple this with a novel end-to-end feature learning framework that accommodates both appearance-based and motion-based tasks. The core of our framework comprises a spatial feature learning module, which utilizes residual-graph convolutional neural networks (RG-CNN), for end-to-end learning of appearance-based features directly from graphs. We extend this with our proposed Graph2Grid block and temporal feature learning module for efficiently modelling temporal dependencies over multiple graphs and a long temporal extent. We show how our framework can be configured for object classification, action recognition and action similarity labeling. Importantly, our approach preserves the spatial and temporal coherence of spike events, while requiring less computation and memory. The experimental validation shows that our proposed framework outperforms all recent methods on standard datasets. Finally, to address the absence of large real-world NVS datasets for complex recognition tasks, we introduce, evaluate and make available the American Sign Language letters (ASL-DVS), as well as human action dataset (UCF101-DVS, HMDB51-DVS and ASLAN-DVS).Comment: 16 pages, 5 figures. This work is a journal extension of our ICCV'19 paper arXiv:1908.0664

    Reconocimiento de gestos por Cámara de eventos con técnicas de Deep Learning

    Get PDF
    Las cámaras de eventos son sensores de visión bio-inspirados cuya salida muestra cambios en la intensidad luminosa de la escena en vez de las imágenes RGB estándares de las cámaras tradicionales. Estas cámaras ofrecen grandes ventajas tales como un gran rango dinámico, no tienen distorsión por movimiento y su latencia de procesado es de microsegundos. Estas características hacen que sea una tecnología muy prometedora para diversas aplicaciones en el ámbito por ejemplo de la robótica o la vídeo-vigilancia. En particular este trabajo se centra en estudiar como utilizar este tipo de cámaras para reconocimiento de acciones, ya que potencialmente estas cámaras pueden ofrecer ventajas como ser capaces de captar movimientos a alta velocidad y con baja iluminación. Todavía hay pocos trabajos e investigaciones sobre esta aplicación de la tecnología de eventos, y por tanto el trabajo se centra en los siguientes puntos: - Comprobar el funcionamiento de los métodos estándares para el reconocimiento de acciones en imágenes con las cámaras de eventos. - Proponer métodos o mejoras sobre estos métodos tradicionales tanto en la representación de eventos como en el procesamiento de los eventos. - Realizar pruebas de reconocimiento de acciones en escenarios donde cámaras tradicionales tienen problemas para obtener información significativa.En este proyecto, en primer lugar se ha estudiado el funcionamiento de esta tecnología y datos y sistemas existentes para reconocimiento de acciones, con datos de eventos o imagen convencional. A continuación se ha diseñado, implementado y evaluado un sistema para reconocimiento de acciones a partir de información de eventos. Las fases principales de este sistema son las siguientes.La codificación de los eventos en frames, donde nos centramos en evaluar dos representaciones, que se proponen en la literatura actual disponible: representación de eventos por tiempo y por eventos. Un clasificador que predice la acción dado uno o varios frames de eventos. En particular se han implementado dos variaciones del sistema, con un clasificador que evalua los frames de forma individual y otro que clasifica grupos de frames.Preprocesado y postprocesado de los frames. Se han propuesto dos estrategias de preprocesado de los frames antes de pasar al clasificador, y dos métodos de post-procesamiento para conseguir una predicción final más robusta, uno simple de consenso entre frames y otro de consenso ponderado buscando un método que posiblemente se adapte mejor a la evolución del movimiento.Distintas configuraciones de estas fases se han evaluado sobre un modelo de red neuronal sencilla, que se establecerá como una red de arquitectura base para tener resultados de manera rápida y poder sacar conclusiones. La configuración que da mejores resultados, se ha evaluado de manera más exhaustiva con una arquitectura de red neuronal mucho más compleja, una red Resnet50V2, para ver mejor el posible alcance de los resultados.Como principal resultado de este trabajo, se ha conseguido proponer un sistema adaptado a los datos de eventos que mejora el rendimiento respecto a procedimientos estándar para procesado de imagen convencional. En particular, se ha concluido que la representación de eventos por eventos es más robusta y mejor que la de por tiempo, debido a que no muestra las inconsistencias en la ejecución de movimientos, y que es esencial incorporar información de la evolución del movimiento.<br /

    Human Action Recognition from Various Data Modalities:A Review

    Get PDF
    Human Action Recognition (HAR), aiming to understand human behaviors and then assign category labels, has a wide range of applications, and thus has been attracting increasing attention in the field of computer vision. Generally, human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared sequence, point cloud, event stream, audio, acceleration, radar, and WiFi, etc., which encode different sources of useful yet distinct information and have various advantages and application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we give a comprehensive survey for HAR from the perspective of the input data modalities. Specifically, we review both the hand-crafted feature-based and deep learning-based methods for single data modalities, and also review the methods based on multiple modalities, including the fusion-based frameworks and the co-learning-based approaches. The current benchmark datasets for HAR are also introduced. Finally, we discuss some potentially important research directions in this area
    corecore