164 research outputs found
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
Self-supervised Video Representation Learning by Pace Prediction
This paper addresses the problem of self-supervised video representation
learning from a new perspective -- by video pace prediction. It stems from the
observation that human visual system is sensitive to video pace, e.g., slow
motion, a widely used technique in film making. Specifically, given a video
played in natural pace, we randomly sample training clips in different paces
and ask a neural network to identify the pace for each video clip. The
assumption here is that the network can only succeed in such a pace reasoning
task when it understands the underlying video content and learns representative
spatio-temporal features. In addition, we further introduce contrastive
learning to push the model towards discriminating different paces by maximizing
the agreement on similar video content. To validate the effectiveness of the
proposed method, we conduct extensive experiments on action recognition and
video retrieval tasks with several alternative network architectures.
Experimental evaluations show that our approach achieves state-of-the-art
performance for self-supervised video representation learning across different
network architectures and different benchmarks. The code and pre-trained models
are available at https://github.com/laura-wang/video-pace.Comment: Correct some typos;Update some cocurent works accepted by ECCV 202
Frame Interpolation with Consecutive Brownian Bridge Diffusion
Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement
Cross-Task Representation Learning for Anatomical Landmark Detection
Recently, there is an increasing demand for automatically detecting
anatomical landmarks which provide rich structural information to facilitate
subsequent medical image analysis. Current methods related to this task often
leverage the power of deep neural networks, while a major challenge in fine
tuning such models in medical applications arises from insufficient number of
labeled samples. To address this, we propose to regularize the knowledge
transfer across source and target tasks through cross-task representation
learning. The proposed method is demonstrated for extracting facial anatomical
landmarks which facilitate the diagnosis of fetal alcohol syndrome. The source
and target tasks in this work are face recognition and landmark detection,
respectively. The main idea of the proposed method is to retain the feature
representations of the source model on the target task data, and to leverage
them as an additional source of supervisory signals for regularizing the target
model learning, thereby improving its performance under limited training
samples. Concretely, we present two approaches for the proposed representation
learning by constraining either final or intermediate model features on the
target model. Experimental results on a clinical face image dataset demonstrate
that the proposed approach works well with few labeled data, and outperforms
other compared approaches.Comment: MICCAI-MLMI 202
Multi-view Self-supervised Disentanglement for General Image Denoising
With its significant performance improvements, the deep learning paradigm has become a standard tool for modern image denoisers. While promising performance has been shown on seen noise distributions, existing approaches often suffer from generalisation to unseen noise types or general and real noise. It is understandable as the model is designed to learn paired mapping (e.g. from a noisy image to its clean version). In this paper, we instead propose to learn to disentangle the noisy image, under the intuitive assumption that different corrupted versions of the same clean image share a common latent space. A self-supervised learning framework is proposed to achieve the goal, without looking at the latent clean image. By taking two different corrupted versions of the same image as input, the proposed Multi-view Self-supervised Disentanglement (MeD) approach learns to disentangle the latent clean features from the corruptions and recover the clean image consequently. Extensive experimental analysis on both synthetic and real noise shows the superiority of the proposed method over prior self-supervised approaches, especially on unseen novel noise types. On real noise, the proposed method even outperforms its supervised counterparts by over 3 dB
Surface-SOS:Self-Supervised Object Segmentation via Neural Surface Representation
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably.</p
Multi-view Self-supervised Disentanglement for General Image Denoising
With its significant performance improvements, the deep learning paradigm has
become a standard tool for modern image denoisers. While promising performance
has been shown on seen noise distributions, existing approaches often suffer
from generalisation to unseen noise types or general and real noise. It is
understandable as the model is designed to learn paired mapping (e.g. from a
noisy image to its clean version). In this paper, we instead propose to learn
to disentangle the noisy image, under the intuitive assumption that different
corrupted versions of the same clean image share a common latent space. A
self-supervised learning framework is proposed to achieve the goal, without
looking at the latent clean image. By taking two different corrupted versions
of the same image as input, the proposed Multi-view Self-supervised
Disentanglement (MeD) approach learns to disentangle the latent clean features
from the corruptions and recover the clean image consequently. Extensive
experimental analysis on both synthetic and real noise shows the superiority of
the proposed method over prior self-supervised approaches, especially on unseen
novel noise types. On real noise, the proposed method even outperforms its
supervised counterparts by over 3 dB.Comment: International Conference on Computer Vision 2023 (ICCV 2023
Affinity Attention Graph Neural Network for Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation is receiving great attention due to
its low human annotation cost. In this paper, we aim to tackle bounding box
supervised semantic segmentation, i.e., training accurate semantic segmentation
models using bounding box annotations as supervision. To this end, we propose
Affinity Attention Graph Neural Network (GNN). Following previous
practices, we first generate pseudo semantic-aware seeds, which are then formed
into semantic graphs based on our newly proposed affinity Convolutional Neural
Network (CNN). Then the built graphs are input to our GNN, in which an
affinity attention layer is designed to acquire the short- and long- distance
information from soft graph edges to accurately propagate semantic labels from
the confident seeds to the unlabeled pixels. However, to guarantee the
precision of the seeds, we only adopt a limited number of confident pixel seed
labels for GNN, which may lead to insufficient supervision for training.
To alleviate this issue, we further introduce a new loss function and a
consistency-checking mechanism to leverage the bounding box constraint, so that
more reliable guidance can be included for the model optimization. Experiments
show that our approach achieves new state-of-the-art performances on Pascal VOC
2012 datasets (val: 76.5\%, test: 75.2\%). More importantly, our approach can
be readily applied to bounding box supervised instance segmentation task or
other weakly supervised semantic segmentation tasks, with state-of-the-art or
comparable performance among almot all weakly supervised tasks on PASCAL VOC or
COCO dataset. Our source code will be available at
https://github.com/zbf1991/A2GNN.Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TAPMI 2021
- …