32 research outputs found

    Multi-Cue Event Information Fusion for Pedestrian Detection With Neuromorphic Vision Sensors

    Get PDF
    Neuromorphic vision sensors are bio-inspired cameras that naturally capture the dynamics of a scene with ultra-low latency, filtering out redundant information with low power consumption. Few works are addressing the object detection with this sensor. In this work, we propose to develop pedestrian detectors that unlock the potential of the event data by leveraging multi-cue information and different fusion strategies. To make the best out of the event data, we introduce three different event-stream encoding methods based on Frequency, Surface of Active Event (SAE) and Leaky Integrate-and-Fire (LIF). We further integrate them into the state-of-the-art neural network architectures with two fusion approaches: the channel-level fusion of the raw feature space and decision-level fusion with the probability assignments. We present a qualitative and quantitative explanation why different encoding methods are chosen to evaluate the pedestrian detection and which method performs the best. We demonstrate the advantages of the decision-level fusion via leveraging multi-cue event information and show that our approach performs well on a self-annotated event-based pedestrian dataset with 8,736 event frames. This work paves the way of more fascinating perception applications with neuromorphic vision sensors

    SODFormer: Streaming Object Detection with Transformer Using Events and Frames

    Full text link
    DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.Comment: 18 pages, 15 figures, in IEEE Transactions on Pattern Analysis and Machine Intelligenc

    Event-based pedestrian detection using dynamic vision sensors

    Get PDF
    Pedestrian detection has attracted great research attention in video surveillance, traffic statistics, and especially in autonomous driving. To date, almost all pedestrian detection solutions are derived from conventional framed-based image sensors with limited reaction speed and high data redundancy. Dynamic vision sensor (DVS), which is inspired by biological retinas, efficiently captures the visual information with sparse, asynchronous events rather than dense, synchronous frames. It can eliminate redundant data transmission and avoid motion blur or data leakage in high-speed imaging applications. However, it is usually impractical to directly apply the event streams to conventional object detection algorithms. For this issue, we first propose a novel event-to-frame conversion method by integrating the inherent characteristics of events more efficiently. Moreover, we design an improved feature extraction network that can reuse intermediate features to further reduce the computational effort. We evaluate the performance of our proposed method on a custom dataset containing multiple real-world pedestrian scenes. The results indicate that our proposed method raised its pedestrian detection accuracy by about 5.6–10.8%, and its detection speed is nearly 20% faster than previously reported methods. Furthermore, it can achieve a processing speed of about 26 FPS and an AP of 87.43% when implanted on a single CPU so that it fully meets the requirement of real-time detection

    NU-AIR -- A Neuromorphic Urban Aerial Dataset for Detection and Localization of Pedestrians and Vehicles

    Full text link
    This paper presents an open-source aerial neuromorphic dataset that captures pedestrians and vehicles moving in an urban environment. The dataset, titled NU-AIR, features 70.75 minutes of event footage acquired with a 640 x 480 resolution neuromorphic sensor mounted on a quadrotor operating in an urban environment. Crowds of pedestrians, different types of vehicles, and street scenes featuring busy urban environments are captured at different elevations and illumination conditions. Manual bounding box annotations of vehicles and pedestrians contained in the recordings are provided at a frequency of 30 Hz, yielding 93,204 labels in total. Evaluation of the dataset's fidelity is performed through comprehensive ablation study for three Spiking Neural Networks (SNNs) and training ten Deep Neural Networks (DNNs) to validate the quality and reliability of both the dataset and corresponding annotations. All data and Python code to voxelize the data and subsequently train SNNs/DNNs has been open-sourced.Comment: 20 pages, 5 figure

    Making sense of neuromorphic event data for human action recognition

    Get PDF
    Neuromorphic vision sensors provide low power sensing and capture salient spatial-temporal events. The majority of the existing neuromorphic sensing work focus on object detection. However, since they only record the events, they provide an efficient signal domain for privacy aware surveillance tasks. This paper explores how the neuromorphic vision sensor data streams can be analysed for human action recognition, which is a challenging application. The proposed method is based on handcrafted features. It consists of a pre-processing step for removing the noisy events followed by the extraction of handcrafted local and global feature vectors corresponding to the underlying human action. The local features are extracted considering a set of high-order descriptive statistics from the spatio-temporal events in a time window slice, while the global features are extracted by considering the frequencies of occurrences of the temporal event sequences. Then, low complexity classifiers, such as, support vector machines (SVM) and K-Nearest Neighbours (KNNs), are trained using these feature vectors. The proposed method evaluation uses three groups of datasets: Emulator-based, re-recording-based and native NVS-based. The proposed method has outperformed the existing methods in terms of human action recognition accuracy rates by 0.54%, 19.3%, and 25.61% for E-KTH, E-UCF11 and E-HMDB51 datasets, respectively. This paper also reports results for three further datasets: E-UCF50, R-UCF50, and N-Actions, which are reported for the first time for human action recognition on neuromorphic vision sensor domain

    Event-based neuromorphic stereo vision

    Full text link

    Multimodaalinen käyttöliittymä interaktiivista yhteistyötä varten nelijalkaisten robottien kanssa

    Get PDF
    A variety of approaches for hand gesture recognition have been proposed, where most interest has recently been directed towards different deep learning methods. The modalities, on which these approaches are based, most commonly range from different imaging sensors to inertial measurement units (IMU) and electromyography (EMG) sensors. EMG and IMUs allow detection of gestures without being affected by the line of sight or lighting conditions. The detection algorithms are fairly well established, but their application to real world use cases is limited, apart from prostheses and exoskeletons. In this thesis, a multimodal interface for human robot interaction (HRI) is developed for quadruped robots. The interface is based on a combination of two detection algorithms; one for detecting gestures based on surface electromyography (sEMG) and IMU signals, and the other for detecting the operator using visible light and depth cameras. Multiple architectures for gesture detection are compared, where the best regression performance with offline multi-user data was achieved by a hybrid of a convolutional neural network (CNN) and a long short-term memory (LSTM), with a mean squared error (MSE) of 4.7 · 10−3 in the normalised gestures. A person-following behaviour is implemented for a quadruped robot, which is controlled using the predefined gestures. The complete interface is evaluated online by one expert user two days after recording the last samples of the training data. The gesture detection system achieved an F-score of 0.95 for the gestures alone, and 0.90, when unrecognised attempts due to other technological aspects, such as disturbances in Bluetooth data transmission, are included. The system to reached online performance levels comparable to those reported for offline sessions and online sessions with real-time visual feedback. While the current interface was successfully deployed to the robot, further advances should be aimed at improving inter-subject performance and wireless communication reliability between the devices.Käden eleiden tunnistamiseksi on ehdotettu useita vaihtoehtoisia ratkaisuja, mutta tällä hetkellä tutkimus- ja kehitystyö on pääasiassa keskittynyt erilaisiin syvän oppimisen menetelmiin. Hyödynnetyt teknologiat vaihtelevat useimmiten kuvantavista antureista inertiamittausyksiköihin (inertial measurement unit, IMU) ja lihassähkökäyrää (electromyography, EMG) mittaaviin antureihin. EMG ja IMU:t mahdollistavat eleiden tunnistuksen riippumatta näköyhteydestä tai valaistusolosuhteista. Eleiden tunnistukseen käytettävät menetelmät ovat jo melko vakiintuneita, mutta niiden käyttökohteet ovat rajoittuneet lähinnä proteeseihin ja ulkoisiin tukirankoihin. Tässä opinnäytetyössä kehitettiin useaa modaliteettia hyödyntävä käyttöliittymä ihmisen ja robotin vuorovaikutusta varten. Käyttöliittymä perustuu kahden menetelmän yhdistelmään, joista ensimmäinen vastaa eleiden tunnistuksesta pohjautuen ihon pinnalta mitattavaan EMG:hen ja IMU-signaaleihin, ja toinen käyttäjän tunnistuksesta näkyvän valon- ja syvyyskameroiden perusteella. Työssä vertaillaan useita eleiden tunnistuksen soveltuvia arkkitehtuureja, joista parhaan tuloksen usean käyttäjän opetusaineistolla saavutti konvoluutineuroverkon (convolutional neural network, CNN) ja pitkäkestoisen lyhytkestomuistin (long short-term memory, LSTM) yhdistelmäarkkitehtuuri. Normalisoitujen eleiden regression keskimääräinen neliöllinen virhe (mean squared error, MSE) oli tällä arkkitehtuurilla 4,7·10−3. Eleitä hyödynnettiin robotille toteutetun henkilön seuraamistehtävän ohjaamisessa. Lopullinen käyttöliittymä arvioitiin yhdellä kokeneella koehenkilöllä kaksi päivää viimeisten eleiden mittaamisen jälkeen. Tällöin eleiden tunnistusjärjestelmä saavutti F-testiarvon 0,95, kun vain eleiden tunnistuksen kyvykkyys huomioitiin. Arvioitaessa koko järjestelmän toimivuutta saavutettiin F-testiarvo 0,90, jossa muun muassa Bluetooth-pohjainen tiedonsiirto heikensi tuloksia. Suoraan robottiin yhteydessä ollessaan, järjestelmän saavuttama eleiden tunnistuskyky vastasi laboratorioissa suoritettujen kokeiden suorituskykyä. Vaikka järjestelmän toiminta vahvistettiin onnistuneesti, tulee tutkimuksen jatkossa keskittyä etenkin ihmisten välisen yleistymisen parantamiseen, sekä langattoman tiedonsiirron ongelmien korjaamiseen
    corecore