9 research outputs found
Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution
Event cameras sense the intensity changes asynchronously and produce event
streams with high dynamic range and low latency. This has inspired research
endeavors utilizing events to guide the challenging video superresolution (VSR)
task. In this paper, we make the first attempt to address a novel problem of
achieving VSR at random scales by taking advantages of the high temporal
resolution property of events. This is hampered by the difficulties of
representing the spatial-temporal information of events when guiding VSR. To
this end, we propose a novel framework that incorporates the spatial-temporal
interpolation of events to VSR in a unified framework. Our key idea is to learn
implicit neural representations from queried spatial-temporal coordinates and
features from both RGB frames and events. Our method contains three parts.
Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D
features from events and RGB frames. Then, the Temporal Filter (TF) module
unlocks more explicit motion information from the events near the queried
timestamp and generates the 2D features. Lastly, the SpatialTemporal Implicit
Representation (STIR) module recovers the SR frame in arbitrary resolutions
from the outputs of these two modules. In addition, we collect a real-world
dataset with spatially aligned events and RGB frames. Extensive experiments
show that our method significantly surpasses the prior-arts and achieves VSR
with random scales, e.g., 6.5. Code and dataset are available at https:
//vlis2022.github.io/cvpr23/egvsr.Comment: Accepted by CVPR202
Data-efficient Event Camera Pre-training via Disentangled Masked Modeling
In this paper, we present a new data-efficient voxel-based self-supervised
learning method for event cameras. Our pre-training overcomes the limitations
of previous methods, which either sacrifice temporal information by converting
event sequences into 2D images for utilizing pre-trained image models or
directly employ paired image data for knowledge distillation to enhance the
learning of event streams. In order to make our pre-training data-efficient, we
first design a semantic-uniform masking method to address the learning
imbalance caused by the varying reconstruction difficulties of different
regions in non-uniform data when using random masking. In addition, we ease the
traditional hybrid masked modeling process by explicitly decomposing it into
two branches, namely local spatio-temporal reconstruction and global semantic
reconstruction to encourage the encoder to capture local correlations and
global semantics, respectively. This decomposition allows our selfsupervised
learning method to converge faster with minimal pre-training data. Compared to
previous approaches, our self-supervised learning method does not rely on
paired RGB images, yet enables simultaneous exploration of spatial and temporal
cues in multiple scales. It exhibits excellent generalization performance and
demonstrates significant improvements across various tasks with fewer
parameters and lower computational costs
Time lens: Event-based Video Frame Interpolation
State-of-the-art frame interpolation methods generate intermediate frames by inferring object motions in the image from consecutive key-frames. In the absence of additional information, first-order approximations, i.e. optical flow, must be used, but this choice restricts the types of motions that can be modeled, leading to errors in highly dynamic scenarios. Event cameras are novel sensors that address this limitation by providing auxiliary visual information in the blind-time between frames. They asynchronously measure per-pixel brightness changes and do this with high temporal resolution and low latency. Event-based frame interpolation methods typically adopt a synthesis-based approach, where predicted frame residuals are directly applied to the key-frames. However, while these approaches can capture non-linear motions they suffer from ghosting and perform poorly in low-texture regions with few events. Thus, synthesis-based and flow-based approaches are complementary. In this work, we introduce Time Lens, a novel method that leverages the advantages of both. We extensively evaluate our method on three synthetic and two real benchmarks where we show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods. Finally, we release a new large-scale dataset in highly dynamic scenarios, aimed at pushing the limits of existing methods
Aggregating Long-term Sharp Features via Hybrid Transformers for Video Deblurring
Video deblurring methods, aiming at recovering consecutive sharp frames from
a given blurry video, usually assume that the input video suffers from
consecutively blurry frames. However, in real-world blurry videos taken by
modern imaging devices, sharp frames usually appear in the given video, thus
making temporal long-term sharp features available for facilitating the
restoration of a blurry frame. In this work, we propose a video deblurring
method that leverages both neighboring frames and present sharp frames using
hybrid Transformers for feature aggregation. Specifically, we first train a
blur-aware detector to distinguish between sharp and blurry frames. Then, a
window-based local Transformer is employed for exploiting features from
neighboring frames, where cross attention is beneficial for aggregating
features from neighboring frames without explicit spatial alignment. To
aggregate long-term sharp features from detected sharp frames, we utilize a
global Transformer with multi-scale matching capability. Moreover, our method
can easily be extended to event-driven video deblurring by incorporating an
event fusion module into the global Transformer. Extensive experiments on
benchmark datasets demonstrate that our proposed method outperforms
state-of-the-art video deblurring methods as well as event-driven video
deblurring methods in terms of quantitative metrics and visual quality. The
source code and trained models are available at
https://github.com/shangwei5/STGTN.Comment: 13 pages, 11 figures, and the code is available at
https://github.com/shangwei5/STGT
TimeLens: Event-based Video Frame Interpolation
State-of-the-art frame interpolation methods generate intermediate frames by
inferring object motions in the image from consecutive key-frames. In the
absence of additional information, first-order approximations, i.e. optical
flow, must be used, but this choice restricts the types of motions that can be
modeled, leading to errors in highly dynamic scenarios. Event cameras are novel
sensors that address this limitation by providing auxiliary visual information
in the blind-time between frames. They asynchronously measure per-pixel
brightness changes and do this with high temporal resolution and low latency.
Event-based frame interpolation methods typically adopt a synthesis-based
approach, where predicted frame residuals are directly applied to the
key-frames. However, while these approaches can capture non-linear motions they
suffer from ghosting and perform poorly in low-texture regions with few events.
Thus, synthesis-based and flow-based approaches are complementary. In this
work, we introduce Time Lens, a novel indicates equal contribution method that
leverages the advantages of both. We extensively evaluate our method on three
synthetic and two real benchmarks where we show an up to 5.21 dB improvement in
terms of PSNR over state-of-the-art frame-based and event-based methods.
Finally, we release a new large-scale dataset in highly dynamic scenarios,
aimed at pushing the limits of existing methods
Neuromorphic Sampling of Signals in Shift-Invariant Spaces
Neuromorphic sampling is a paradigm shift in analog-to-digital conversion
where the acquisition strategy is opportunistic and measurements are recorded
only when there is a significant change in the signal. Neuromorphic sampling
has given rise to a new class of event-based sensors called dynamic vision
sensors or neuromorphic cameras. The neuromorphic sampling mechanism utilizes
low power and provides high-dynamic range sensing with low latency and high
temporal resolution. The measurements are sparse and have low redundancy making
it convenient for downstream tasks. In this paper, we present a
sampling-theoretic perspective to neuromorphic sensing of continuous-time
signals. We establish a close connection between neuromorphic sampling and
time-based sampling - where signals are encoded temporally. We analyse
neuromorphic sampling of signals in shift-invariant spaces, in particular,
bandlimited signals and polynomial splines. We present an iterative technique
for perfect reconstruction subject to the events satisfying a density
criterion. We also provide necessary and sufficient conditions for perfect
reconstruction. Owing to practical limitations in meeting the sufficient
conditions for perfect reconstruction, we extend the analysis to approximate
reconstruction from sparse events. In the latter setting, we pose signal
reconstruction as a continuous-domain linear inverse problem whose solution can
be obtained by solving an equivalent finite-dimensional convex optimization
program using a variable-splitting approach. We demonstrate the performance of
the proposed algorithm and validate our claims via experiments on synthetic
signals
Neuromorphic Synergy for Video Binarization
Bimodal objects, such as the checkerboard pattern used in camera calibration,
markers for object tracking, and text on road signs, to name a few, are
prevalent in our daily lives and serve as a visual form to embed information
that can be easily recognized by vision systems. While binarization from
intensity images is crucial for extracting the embedded information in the
bimodal objects, few previous works consider the task of binarization of blurry
images due to the relative motion between the vision sensor and the
environment. The blurry images can result in a loss in the binarization quality
and thus degrade the downstream applications where the vision system is in
motion. Recently, neuromorphic cameras offer new capabilities for alleviating
motion blur, but it is non-trivial to first deblur and then binarize the images
in a real-time manner. In this work, we propose an event-based binary
reconstruction method that leverages the prior knowledge of the bimodal
target's properties to perform inference independently in both event space and
image space and merge the results from both domains to generate a sharp binary
image. We also develop an efficient integration method to propagate this binary
image to high frame rate binary video. Finally, we develop a novel method to
naturally fuse events and images for unsupervised threshold identification. The
proposed method is evaluated in publicly available and our collected data
sequence, and shows the proposed method can outperform the SOTA methods to
generate high frame rate binary video in real-time on CPU-only devices.Comment: N