37 research outputs found
End-to-End Learning of Representations for Asynchronous Event-Based Data
Event cameras are vision sensors that record asynchronous streams of
per-pixel brightness changes, referred to as "events". They have appealing
advantages over frame-based cameras for computer vision, including high
temporal resolution, high dynamic range, and no motion blur. Due to the sparse,
non-uniform spatiotemporal layout of the event signal, pattern recognition
algorithms typically aggregate events into a grid-based representation and
subsequently process it by a standard vision pipeline, e.g., Convolutional
Neural Network (CNN). In this work, we introduce a general framework to convert
event streams into grid-based representations through a sequence of
differentiable operations. Our framework comes with two main advantages: (i)
allows learning the input event representation together with the task dedicated
network in an end to end manner, and (ii) lays out a taxonomy that unifies the
majority of extant event representations in the literature and identifies novel
ones. Empirically, we show that our approach to learning the event
representation end-to-end yields an improvement of approximately 12% on optical
flow estimation and object recognition over state-of-the-art methods.Comment: To appear at ICCV 201
Event Encryption: Rethinking Privacy Exposure for Neuromorphic Imaging
Bio-inspired neuromorphic cameras sense illumination changes on a per-pixel
basis and generate spatiotemporal streaming events within microseconds in
response, offering visual information with high temporal resolution over a high
dynamic range. Such devices often serve in surveillance systems due to their
applicability and robustness in environments with high dynamics and strong or
weak lighting, where they can still supply clearer recordings than traditional
imaging. In other words, when it comes to privacy-relevant cases, neuromorphic
cameras also expose more sensitive data and thus pose serious security threats.
Therefore, asynchronous event streams also necessitate careful encryption
before transmission and usage. This letter discusses several potential attack
scenarios and approaches event encryption from the perspective of neuromorphic
noise removal, in which we inversely introduce well-crafted noise into raw
events until they are obfuscated. Evaluations show that the encrypted events
can effectively protect information from the attacks of low-level visual
reconstruction and high-level neuromorphic reasoning, and thus feature
dependable privacy-preserving competence. Our solution gives impetus to the
security of event data and paves the way to a highly encrypted technique for
privacy-protective neuromorphic imaging
E-CLIP: Towards Label-efficient Event-based Open-world Understanding by CLIP
Contrasting Language-image pertaining (CLIP) has recently shown promising
open-world and few-shot performance on 2D image-based recognition tasks.
However, the transferred capability of CLIP to the novel event camera data
still remains under-explored. In particular, due to the modality gap with the
image-text data and the lack of large-scale datasets, achieving this goal is
non-trivial and thus requires significant research innovation. In this paper,
we propose E-CLIP, a novel and effective framework that unleashes the potential
of CLIP for event-based recognition to compensate for the lack of large-scale
event-based datasets. Our work addresses two crucial challenges: 1) how to
generalize CLIP's visual encoder to event data while fully leveraging events'
unique properties, e.g., sparsity and high temporal resolution; 2) how to
effectively align the multi-modal embeddings, i.e., image, text, and events. To
this end, we first introduce a novel event encoder that subtly models the
temporal information from events and meanwhile generates event prompts to
promote the modality bridging. We then design a text encoder that generates
content prompts and utilizes hybrid text prompts to enhance the E-CLIP's
generalization ability across diverse datasets. With the proposed event
encoder, text encoder, and original image encoder, a novel Hierarchical Triple
Contrastive Alignment (HTCA) module is introduced to jointly optimize the
correlation and enable efficient knowledge transfer among the three modalities.
We conduct extensive experiments on two recognition benchmarks, and the results
demonstrate that our E-CLIP outperforms existing methods by a large margin of
+3.94% and +4.62% on the N-Caltech dataset, respectively, in both fine-tuning
and few-shot settings. Moreover, our E-CLIP can be flexibly extended to the
event retrieval task using both text or image queries, showing plausible
performance.Comment: Jounal version with supplementary materia
Evaluating Spiking Neural Network On Neuromorphic Platform For Human Activity Recognition
Energy efficiency and low latency are crucial requirements for designing
wearable AI-empowered human activity recognition systems, due to the hard
constraints of battery operations and closed-loop feedback. While neural
network models have been extensively compressed to match the stringent edge
requirements, spiking neural networks and event-based sensing are recently
emerging as promising solutions to further improve performance due to their
inherent energy efficiency and capacity to process spatiotemporal data in very
low latency. This work aims to evaluate the effectiveness of spiking neural
networks on neuromorphic processors in human activity recognition for wearable
applications. The case of workout recognition with wrist-worn wearable motion
sensors is used as a study. A multi-threshold delta modulation approach is
utilized for encoding the input sensor data into spike trains to move the
pipeline into the event-based approach. The spikes trains are then fed to a
spiking neural network with direct-event training, and the trained model is
deployed on the research neuromorphic platform from Intel, Loihi, to evaluate
energy and latency efficiency. Test results show that the spike-based workouts
recognition system can achieve a comparable accuracy (87.5\%) comparable to the
popular milliwatt RISC-V bases multi-core processor GAP8 with a traditional
neural network ( 88.1\%) while achieving two times better energy-delay product
(0.66 \si{\micro\joule\second} vs. 1.32 \si{\micro\joule\second})
Self-supervised Event-based Monocular Depth Estimation using Cross-modal Consistency
An event camera is a novel vision sensor that can capture per-pixel
brightness changes and output a stream of asynchronous ``events''. It has
advantages over conventional cameras in those scenes with high-speed motions
and challenging lighting conditions because of the high temporal resolution,
high dynamic range, low bandwidth, low power consumption, and no motion blur.
Therefore, several supervised monocular depth estimation from events is
proposed to address scenes difficult for conventional cameras. However, depth
annotation is costly and time-consuming. In this paper, to lower the annotation
cost, we propose a self-supervised event-based monocular depth estimation
framework named EMoDepth. EMoDepth constrains the training process using the
cross-modal consistency from intensity frames that are aligned with events in
the pixel coordinate. Moreover, in inference, only events are used for
monocular depth prediction. Additionally, we design a multi-scale
skip-connection architecture to effectively fuse features for depth estimation
while maintaining high inference speed. Experiments on MVSEC and DSEC datasets
demonstrate that our contributions are effective and that the accuracy can
outperform existing supervised event-based and unsupervised frame-based
methods.Comment: Accepted by IROS202
Event-driven Vision and Control for UAVs on a Neuromorphic Chip
Event-based vision sensors achieve up to three orders of magnitude better speed vs. power consumption trade off in high-speed control of UAVs compared to conventional image sensors. Event-based cameras produce a sparse stream of events that can be processed more efficiently and with a lower latency than images, enabling ultra-fast vision-driven control. Here, we explore how an event-based vision algorithm can be implemented as a spiking neuronal network on a neuromorphic chip and used in a drone controller. We show how seamless integration of event-based perception on chip leads to even faster control rates and lower latency. In addition, we demonstrate how online adaptation of the SNN controller can be realised using on-chip learning. Our spiking neuronal network on chip is the first example of a neuromorphic vision-based controller on chip solving a high-speed UAV control task. The excellent scalability of processing in neuromorphic hardware opens the possibility to solve more challenging visual tasks in the future and integrate visual perception in fast control loops
DH-PTAM: A Deep Hybrid Stereo Events-Frames Parallel Tracking And Mapping System
This paper presents a robust approach for a visual parallel tracking and
mapping (PTAM) system that excels in challenging environments. Our proposed
method combines the strengths of heterogeneous multi-modal visual sensors,
including stereo event-based and frame-based sensors, in a unified reference
frame through a novel spatio-temporal synchronization of stereo visual frames
and stereo event streams. We employ deep learning-based feature extraction and
description for estimation to enhance robustness further. We also introduce an
end-to-end parallel tracking and mapping optimization layer complemented by a
simple loop-closure algorithm for efficient SLAM behavior. Through
comprehensive experiments on both small-scale and large-scale real-world
sequences of VECtor and TUM-VIE benchmarks, our proposed method (DH-PTAM)
demonstrates superior performance compared to state-of-the-art methods in terms
of robustness and accuracy in adverse conditions. Our implementation's
research-based Python API is publicly available on GitHub for further research
and development: https://github.com/AbanobSoliman/DH-PTAM.Comment: Submitted for publication in IEEE RA-