63 research outputs found
FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer
Limited by hardware cost and system size, camera's Field-of-View (FoV) is not
always satisfactory. However, from a spatio-temporal perspective, information
beyond the camera's physical FoV is off-the-shelf and can actually be obtained
"for free" from the past. In this paper, we propose a novel task termed
Beyond-FoV Estimation, aiming to exploit past visual cues and bidirectional
break through the physical FoV of a camera. We put forward a FlowLens
architecture to expand the FoV by achieving feature propagation explicitly by
optical flow and implicitly by a novel clip-recurrent transformer, which has
two appealing features: 1) FlowLens comprises a newly proposed Clip-Recurrent
Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global
information accumulated in the temporal dimension. 2) A multi-branch Mix Fusion
Feed Forward Network (MixF3N) is integrated to enhance the spatially-precise
flow of local features. To foster training and evaluation, we establish
KITTI360-EX, a dataset for outer- and inner FoV expansion. Extensive
experiments on both video inpainting and beyond-FoV estimation tasks show that
FlowLens achieves state-of-the-art performance. Code will be made publicly
available at https://github.com/MasterHow/FlowLens.Comment: Code will be made publicly available at
https://github.com/MasterHow/FlowLen
Facial Video-based Remote Physiological Measurement via Self-supervised Learning
Facial video-based remote physiological measurement aims to estimate remote
photoplethysmography (rPPG) signals from human face videos and then measure
multiple vital signs (e.g. heart rate, respiration frequency) from rPPG
signals. Recent approaches achieve it by training deep neural networks, which
normally require abundant facial videos and synchronously recorded
photoplethysmography (PPG) signals for supervision. However, the collection of
these annotated corpora is not easy in practice. In this paper, we introduce a
novel frequency-inspired self-supervised framework that learns to estimate rPPG
signals from facial videos without the need of ground truth PPG signals. Given
a video sample, we first augment it into multiple positive/negative samples
which contain similar/dissimilar signal frequencies to the original one.
Specifically, positive samples are generated using spatial augmentation.
Negative samples are generated via a learnable frequency augmentation module,
which performs non-linear signal frequency transformation on the input without
excessively changing its visual appearance. Next, we introduce a local rPPG
expert aggregation module to estimate rPPG signals from augmented samples. It
encodes complementary pulsation information from different face regions and
aggregate them into one rPPG prediction. Finally, we propose a series of
frequency-inspired losses, i.e. frequency contrastive loss, frequency ratio
consistency loss, and cross-video frequency agreement loss, for the
optimization of estimated rPPG signals from multiple augmented video samples
and across temporally neighboring video samples. We conduct rPPG-based heart
rate, heart rate variability and respiration frequency estimation on four
standard benchmarks. The experimental results demonstrate that our method
improves the state of the art by a large margin.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligenc
- …