110 research outputs found
Dual Contrastive Learning for Spatio-temporal Representation
Contrastive learning has shown promising potential in self-supervised
spatio-temporal representation learning. Most works naively sample different
clips to construct positive and negative pairs. However, we observe that this
formulation inclines the model towards the background scene bias. The
underlying reasons are twofold. First, the scene difference is usually more
noticeable and easier to discriminate than the motion difference. Second, the
clips sampled from the same video often share similar backgrounds but have
distinct motions. Simply regarding them as positive pairs will draw the model
to the static background rather than the motion pattern. To tackle this
challenge, this paper presents a novel dual contrastive formulation.
Concretely, we decouple the input RGB video sequence into two complementary
modes, static scene and dynamic motion. Then, the original RGB features are
pulled closer to the static features and the aligned dynamic features,
respectively. In this way, the static scene and the dynamic motion are
simultaneously encoded into the compact RGB representation. We further conduct
the feature space decoupling via activation maps to distill static- and
dynamic-related features. We term our method as \textbf{D}ual
\textbf{C}ontrastive \textbf{L}earning for spatio-temporal
\textbf{R}epresentation (DCLR). Extensive experiments demonstrate that DCLR
learns effective spatio-temporal representations and obtains state-of-the-art
or comparable performance on UCF-101, HMDB-51, and Diving-48 datasets.Comment: ACM MM 2022 camera read
Light Field Reconstruction via Attention-Guided Deep Fusion of Hybrid Lenses
This paper explores the problem of reconstructing high-resolution light field
(LF) images from hybrid lenses, including a high-resolution camera surrounded
by multiple low-resolution cameras. The performance of existing methods is
still limited, as they produce either blurry results on plain textured areas or
distortions around depth discontinuous boundaries. To tackle this challenge, we
propose a novel end-to-end learning-based approach, which can comprehensively
utilize the specific characteristics of the input from two complementary and
parallel perspectives. Specifically, one module regresses a spatially
consistent intermediate estimation by learning a deep multidimensional and
cross-domain feature representation, while the other module warps another
intermediate estimation, which maintains the high-frequency textures, by
propagating the information of the high-resolution view. We finally leverage
the advantages of the two intermediate estimations adaptively via the learned
attention maps, leading to the final high-resolution LF image with satisfactory
results on both plain textured areas and depth discontinuous boundaries.
Besides, to promote the effectiveness of our method trained with simulated
hybrid data on real hybrid data captured by a hybrid LF imaging system, we
carefully design the network architecture and the training strategy. Extensive
experiments on both real and simulated hybrid data demonstrate the significant
superiority of our approach over state-of-the-art ones. To the best of our
knowledge, this is the first end-to-end deep learning method for LF
reconstruction from a real hybrid input. We believe our framework could
potentially decrease the cost of high-resolution LF data acquisition and
benefit LF data storage and transmission.Comment: 14 pages, 8 figures. arXiv admin note: text overlap with
arXiv:1907.0964
- …