156 research outputs found
Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
We address the problem of video representation learning without
human-annotated labels. While previous efforts address the problem by designing
novel self-supervised tasks using video data, the learned features are merely
on a frame-by-frame basis, which are not applicable to many video analytic
tasks where spatio-temporal features are prevailing. In this paper we propose a
novel self-supervised approach to learn spatio-temporal features for video
representation. Inspired by the success of two-stream approaches in video
classification, we propose to learn visual features by regressing both motion
and appearance statistics along spatial and temporal dimensions, given only the
input video data. Specifically, we extract statistical concepts (fast-motion
region and the corresponding dominant direction, spatio-temporal color
diversity, dominant color, etc.) from simple patterns in both spatial and
temporal domains. Unlike prior puzzles that are even hard for humans to solve,
the proposed approach is consistent with human inherent visual habits and
therefore easy to answer. We conduct extensive experiments with C3D to validate
the effectiveness of our proposed approach. The experiments show that our
approach can significantly improve the performance of C3D when applied to video
classification tasks. Code is available at
https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201
Self-supervised Video Representation Learning by Pace Prediction
This paper addresses the problem of self-supervised video representation
learning from a new perspective -- by video pace prediction. It stems from the
observation that human visual system is sensitive to video pace, e.g., slow
motion, a widely used technique in film making. Specifically, given a video
played in natural pace, we randomly sample training clips in different paces
and ask a neural network to identify the pace for each video clip. The
assumption here is that the network can only succeed in such a pace reasoning
task when it understands the underlying video content and learns representative
spatio-temporal features. In addition, we further introduce contrastive
learning to push the model towards discriminating different paces by maximizing
the agreement on similar video content. To validate the effectiveness of the
proposed method, we conduct extensive experiments on action recognition and
video retrieval tasks with several alternative network architectures.
Experimental evaluations show that our approach achieves state-of-the-art
performance for self-supervised video representation learning across different
network architectures and different benchmarks. The code and pre-trained models
are available at https://github.com/laura-wang/video-pace.Comment: Correct some typos;Update some cocurent works accepted by ECCV 202
Cross-Task Representation Learning for Anatomical Landmark Detection
Recently, there is an increasing demand for automatically detecting
anatomical landmarks which provide rich structural information to facilitate
subsequent medical image analysis. Current methods related to this task often
leverage the power of deep neural networks, while a major challenge in fine
tuning such models in medical applications arises from insufficient number of
labeled samples. To address this, we propose to regularize the knowledge
transfer across source and target tasks through cross-task representation
learning. The proposed method is demonstrated for extracting facial anatomical
landmarks which facilitate the diagnosis of fetal alcohol syndrome. The source
and target tasks in this work are face recognition and landmark detection,
respectively. The main idea of the proposed method is to retain the feature
representations of the source model on the target task data, and to leverage
them as an additional source of supervisory signals for regularizing the target
model learning, thereby improving its performance under limited training
samples. Concretely, we present two approaches for the proposed representation
learning by constraining either final or intermediate model features on the
target model. Experimental results on a clinical face image dataset demonstrate
that the proposed approach works well with few labeled data, and outperforms
other compared approaches.Comment: MICCAI-MLMI 202
Multi-view Self-supervised Disentanglement for General Image Denoising
With its significant performance improvements, the deep learning paradigm has become a standard tool for modern image denoisers. While promising performance has been shown on seen noise distributions, existing approaches often suffer from generalisation to unseen noise types or general and real noise. It is understandable as the model is designed to learn paired mapping (e.g. from a noisy image to its clean version). In this paper, we instead propose to learn to disentangle the noisy image, under the intuitive assumption that different corrupted versions of the same clean image share a common latent space. A self-supervised learning framework is proposed to achieve the goal, without looking at the latent clean image. By taking two different corrupted versions of the same image as input, the proposed Multi-view Self-supervised Disentanglement (MeD) approach learns to disentangle the latent clean features from the corruptions and recover the clean image consequently. Extensive experimental analysis on both synthetic and real noise shows the superiority of the proposed method over prior self-supervised approaches, especially on unseen novel noise types. On real noise, the proposed method even outperforms its supervised counterparts by over 3 dB
Surface-SOS:Self-Supervised Object Segmentation via Neural Surface Representation
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably.</p
Multi-view Self-supervised Disentanglement for General Image Denoising
With its significant performance improvements, the deep learning paradigm has
become a standard tool for modern image denoisers. While promising performance
has been shown on seen noise distributions, existing approaches often suffer
from generalisation to unseen noise types or general and real noise. It is
understandable as the model is designed to learn paired mapping (e.g. from a
noisy image to its clean version). In this paper, we instead propose to learn
to disentangle the noisy image, under the intuitive assumption that different
corrupted versions of the same clean image share a common latent space. A
self-supervised learning framework is proposed to achieve the goal, without
looking at the latent clean image. By taking two different corrupted versions
of the same image as input, the proposed Multi-view Self-supervised
Disentanglement (MeD) approach learns to disentangle the latent clean features
from the corruptions and recover the clean image consequently. Extensive
experimental analysis on both synthetic and real noise shows the superiority of
the proposed method over prior self-supervised approaches, especially on unseen
novel noise types. On real noise, the proposed method even outperforms its
supervised counterparts by over 3 dB.Comment: International Conference on Computer Vision 2023 (ICCV 2023
Affinity Attention Graph Neural Network for Weakly Supervised Semantic Segmentation
Weakly supervised semantic segmentation is receiving great attention due to
its low human annotation cost. In this paper, we aim to tackle bounding box
supervised semantic segmentation, i.e., training accurate semantic segmentation
models using bounding box annotations as supervision. To this end, we propose
Affinity Attention Graph Neural Network (GNN). Following previous
practices, we first generate pseudo semantic-aware seeds, which are then formed
into semantic graphs based on our newly proposed affinity Convolutional Neural
Network (CNN). Then the built graphs are input to our GNN, in which an
affinity attention layer is designed to acquire the short- and long- distance
information from soft graph edges to accurately propagate semantic labels from
the confident seeds to the unlabeled pixels. However, to guarantee the
precision of the seeds, we only adopt a limited number of confident pixel seed
labels for GNN, which may lead to insufficient supervision for training.
To alleviate this issue, we further introduce a new loss function and a
consistency-checking mechanism to leverage the bounding box constraint, so that
more reliable guidance can be included for the model optimization. Experiments
show that our approach achieves new state-of-the-art performances on Pascal VOC
2012 datasets (val: 76.5\%, test: 75.2\%). More importantly, our approach can
be readily applied to bounding box supervised instance segmentation task or
other weakly supervised semantic segmentation tasks, with state-of-the-art or
comparable performance among almot all weakly supervised tasks on PASCAL VOC or
COCO dataset. Our source code will be available at
https://github.com/zbf1991/A2GNN.Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligence (TAPMI 2021
CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation
Class incremental semantic segmentation aims to strike a balance between the model’s stability and plasticity by maintaining old knowledge while adapting to new concepts. However, most state-of-the-art methods use the freeze strategy for stability, which compromises the model’s plasticity. In contrast, releasing parameter training for plasticity could lead to the best performance for all categories, but this requires discriminative feature representation. Therefore, we prioritize the model’s plasticity and propose the Contrast inter- and intra-class representations for Incremental Segmentation (CoinSeg), which pursues discriminative representations for flexible parameter tuning. Inspired by the Gaussian mixture model that samples from a mixture of Gaussian distributions, CoinSeg emphasizes intra-class diversity with multiple contrastive representation centroids. Specifically, we use mask proposals to identify regions with strong objectness that are likely to be diverse instances/centroids of a category. These mask proposals are then used for contrastive representations to reinforce intra-class diversity. Meanwhile, to avoid bias from intra-class diversity, we also apply category-level pseudo-labels to enhance category-level consistency and inter-category diversity. Additionally, CoinSeg ensures the model’s stability and alleviates forgetting through a specific flexible tuning strategy. We validate CoinSeg on Pascal VOC 2012 and ADE20K datasets with multiple incremental scenarios and achieve superior results compared to previous state-of-the-art methods, especially in more challenging and realistic long-term scenarios. Code is available at https://github.com/zkzhang98/CoinSeg
- …