110 research outputs found
Learning Attentive Pairwise Interaction for Fine-Grained Classification
Fine-grained classification is a challenging problem, due to subtle
differences among highly-confused categories. Most approaches address this
difficulty by learning discriminative representation of individual input image.
On the other hand, humans can effectively identify contrastive clues by
comparing image pairs. Inspired by this fact, this paper proposes a simple but
effective Attentive Pairwise Interaction Network (API-Net), which can
progressively recognize a pair of fine-grained images by interaction.
Specifically, API-Net first learns a mutual feature vector to capture semantic
differences in the input pair. It then compares this mutual vector with
individual vectors to generate gates for each input image. These distinct gate
vectors inherit mutual context on semantic differences, which allow API-Net to
attentively capture contrastive clues by pairwise interaction between two
images. Additionally, we train API-Net in an end-to-end manner with a score
ranking regularization, which can further generalize API-Net by taking feature
priorities into account. We conduct extensive experiments on five popular
benchmarks in fine-grained classification. API-Net outperforms the recent SOTA
methods, i.e., CUB-200-2011 (90.0%), Aircraft(93.9%), Stanford Cars (95.3%),
Stanford Dogs (90.3%), and NABirds (88.1%).Comment: Accepted at AAAI-202
CP3: Unifying Point Cloud Completion by Pretrain-Prompt-Predict Paradigm
Point cloud completion aims to predict complete shape from its partial
observation. Current approaches mainly consist of generation and refinement
stages in a coarse-to-fine style. However, the generation stage often lacks
robustness to tackle different incomplete variations, while the refinement
stage blindly recovers point clouds without the semantic awareness. To tackle
these challenges, we unify point cloud Completion by a generic
Pretrain-Prompt-Predict paradigm, namely CP3. Inspired by prompting approaches
from NLP, we creatively reinterpret point cloud generation and refinement as
the prompting and predicting stages, respectively. Then, we introduce a concise
self-supervised pretraining stage before prompting. It can effectively increase
robustness of point cloud generation, by an Incompletion-Of-Incompletion (IOI)
pretext task. Moreover, we develop a novel Semantic Conditional Refinement
(SCR) network at the predicting stage. It can discriminatively modulate
multi-scale refinement with the guidance of semantics. Finally, extensive
experiments demonstrate that our CP3 outperforms the state-of-the-art methods
with a large margin
Context-Transformer: Tackling Object Confusion for Few-Shot Detection
Few-shot object detection is a challenging but realistic scenario, where only
a few annotated training images are available for training detectors. A popular
approach to handle this problem is transfer learning, i.e., fine-tuning a
detector pretrained on a source-domain benchmark. However, such transferred
detector often fails to recognize new objects in the target domain, due to low
data diversity of training samples. To tackle this problem, we propose a novel
Context-Transformer within a concise deep transfer framework. Specifically,
Context-Transformer can effectively leverage source-domain object knowledge as
guidance, and automatically exploit contexts from only a few training images in
the target domain. Subsequently, it can adaptively integrate these relational
clues to enhance the discriminative power of detector, in order to reduce
object confusion in few-shot scenarios. Moreover, Context-Transformer is
flexibly embedded in the popular SSD-style detectors, which makes it a
plug-and-play module for end-to-end few-shot learning. Finally, we evaluate
Context-Transformer on the challenging settings of few-shot detection and
incremental few-shot detection. The experimental results show that, our
framework outperforms the recent state-of-the-art approaches.Comment: Accepted by AAAI-202
PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos
The end-to-end Human Mesh Recovery (HMR) approach has been successfully used
for 3D body reconstruction. However, most HMR-based frameworks reconstruct
human body by directly learning mesh parameters from images or videos, while
lacking explicit guidance of 3D human pose in visual data. As a result, the
generated mesh often exhibits incorrect pose for complex activities. To tackle
this problem, we propose to exploit 3D pose to calibrate human mesh.
Specifically, we develop two novel Pose Calibration frameworks, i.e., Serial
PC-HMR and Parallel PC-HMR. By coupling advanced 3D pose estimators and HMR in
a serial or parallel manner, these two frameworks can effectively correct human
mesh with guidance of a concise pose calibration module. Furthermore, since the
calibration module is designed via non-rigid pose transformation, our PC-HMR
frameworks can flexibly tackle bone length variations to alleviate misplacement
in the calibrated mesh. Finally, our frameworks are based on generic and
complementary integration of data-driven learning and geometrical modeling. Via
plug-and-play modules, they can be efficiently adapted for both
image/video-based human mesh recovery. Additionally, they have no requirement
of extra 3D pose annotations in the testing phase, which releases inference
difficulties in practice. We perform extensive experiments on the popular
bench-marks, i.e., Human3.6M, 3DPW and SURREAL, where our PC-HMR frameworks
achieve the SOTA results.Comment: 9 pages, 7 figures. AAAI202
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation
Models (IFMs), which face challenges in transferring to the video domain.
Although VideoMAE has trained a robust ViT from limited data, its low-level
reconstruction poses convergence difficulties and conflicts with high-level
cross-modal alignment. This paper proposes a training-efficient method for
temporal-sensitive VFMs that integrates the benefits of existing methods. To
increase data efficiency, we mask out most of the low-semantics video tokens,
but selectively align the unmasked tokens with IFM, which serves as the
UnMasked Teacher (UMT). By providing semantic guidance, our method enables
faster convergence and multimodal friendliness. With a progressive pre-training
framework, our model can handle various tasks including scene-related,
temporal-related, and complex video-language understanding. Using only public
sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16
achieves state-of-the-art performances on various video tasks. The code and
models will be released at https://github.com/OpenGVLab/unmasked_teacher.Comment: 16 pages, 5 figures, 28 table
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Masked Modeling (MM) has demonstrated widespread success in various vision
challenges, by reconstructing masked visual patches. Yet, applying MM for
large-scale 3D scenes remains an open problem due to the data sparsity and
scene complexity. The conventional random masking paradigm used in 2D images
often causes a high risk of ambiguity when recovering the masked region of 3D
scenes. To this end, we propose a novel informative-preserved reconstruction,
which explores local statistics to discover and preserve the representative
structured points, effectively enhancing the pretext masking task for 3D scene
understanding. Integrated with a progressive reconstruction manner, our method
can concentrate on modeling regional geometry and enjoy less ambiguity for
masked reconstruction. Besides, such scenes with progressive masking ratios can
also serve to self-distill their intrinsic spatial consistency, requiring to
learn the consistent representations from unmasked areas. By elegantly
combining informative-preserved reconstruction on masked areas and consistency
self-distillation from unmasked areas, a unified framework called MM-3DScene is
yielded. We conduct comprehensive experiments on a host of downstream tasks.
The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2%
mIoU on semantic segmentation) demonstrates the superiority of our approach
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Scale is the primary factor for building a powerful foundation model that
could well generalize to a variety of downstream tasks. However, it is still
challenging to train video foundation models with billions of parameters. This
paper shows that video masked autoencoder (VideoMAE) is a scalable and general
self-supervised pre-trainer for building video foundation models. We scale the
VideoMAE in both model and data with a core design. Specifically, we present a
dual masking strategy for efficient pre-training, with an encoder operating on
a subset of video tokens and a decoder processing another subset of video
tokens. Although VideoMAE is very efficient due to high masking ratio in
encoder, masking decoder can still further reduce the overall computational
cost. This enables the efficient pre-training of billion-level models in video.
We also use a progressive training paradigm that involves an initial
pre-training on a diverse multi-sourced unlabeled dataset, followed by a
post-pre-training on a mixed labeled dataset. Finally, we successfully train a
video ViT model with a billion parameters, which achieves a new
state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and
89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In
addition, we extensively verify the pre-trained video ViT models on a variety
of downstream tasks, demonstrating its effectiveness as a general video
representation learner.Comment: CVPR 2023 camera-ready versio
- …