52 research outputs found
CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection
The relation modeling between actors and scene context advances video action
detection where the correlation of multiple actors makes their action
recognition challenging. Existing studies model each actor and scene relation
to improve action recognition. However, the scene variations and background
interference limit the effectiveness of this relation modeling. In this paper,
we propose to select actor-related scene context, rather than directly leverage
raw video scenario, to improve relation modeling. We develop a Cycle
Actor-Context Relation network (CycleACR) where there is a symmetric graph that
models the actor and context relations in a bidirectional form. Our CycleACR
consists of the Actor-to-Context Reorganization (A2C-R) that collects actor
features for context feature reorganizations, and the Context-to-Actor
Enhancement (C2A-E) that dynamically utilizes reorganized context features for
actor feature enhancement. Compared to existing designs that focus on C2A-E,
our CycleACR introduces A2C-R for a more effective relation modeling. This
modeling advances our CycleACR to achieve state-of-the-art performance on two
popular action detection datasets (i.e., AVA and UCF101-24). We also provide
ablation studies and visualizations as well to show how our cycle actor-context
relation modeling improves video action detection. Code is available at
https://github.com/MCG-NJU/CycleACR.Comment: technical repor
TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation
Dynamic scene graph generation (SGG) focuses on detecting objects in a video
and determining their pairwise relationships. Existing dynamic SGG methods
usually suffer from several issues, including 1) Contextual noise, as some
frames might contain occluded and blurred objects. 2) Label bias, primarily due
to the high imbalance between a few positive relationship samples and numerous
negative ones. Additionally, the distribution of relationships exhibits a
long-tailed pattern. To address the above problems, in this paper, we introduce
a network named TD-Net that aims at denoising and debiasing for dynamic
SGG. Specifically, we first propose a denoising spatio-temporal transformer
module that enhances object representation with robust contextual information.
This is achieved by designing a differentiable Top-K object selector that
utilizes the gumbel-softmax sampling strategy to select the relevant
neighborhood for each object. Second, we introduce an asymmetrical reweighting
loss to relieve the issue of label bias. This loss function integrates
asymmetry focusing factors and the volume of samples to adjust the weights
assigned to individual samples. Systematic experimental results demonstrate the
superiority of our proposed TD-Net over existing state-of-the-art
approaches on Action Genome databases. In more detail, TD-Net outperforms
the second-best competitors by 12.7 \% on mean-Recall@10 for predicate
classification.Comment: Accepted by AAAI 202
Pose-disentangled Contrastive Learning for Self-supervised Facial Representation
Self-supervised facial representation has recently attracted increasing
attention due to its ability to perform face understanding without relying on
large-scale annotated datasets heavily. However, analytically, current
contrastive-based self-supervised learning still performs unsatisfactorily for
learning facial representation. More specifically, existing contrastive
learning (CL) tends to learn pose-invariant features that cannot depict the
pose details of faces, compromising the learning performance. To conquer the
above limitation of CL, we propose a novel Pose-disentangled Contrastive
Learning (PCL) method for general self-supervised facial representation. Our
PCL first devises a pose-disentangled decoder (PDD) with a delicately designed
orthogonalizing regulation, which disentangles the pose-related features from
the face-aware features; therefore, pose-related and other pose-unrelated
facial information could be performed in individual subnetworks and do not
affect each other's training. Furthermore, we introduce a pose-related
contrastive learning scheme that learns pose-related information based on data
augmentation of the same image, which would deliver more effective face-aware
representation for various downstream tasks. We conducted a comprehensive
linear evaluation on three challenging downstream facial understanding tasks,
i.e., facial expression recognition, face recognition, and AU detection.
Experimental results demonstrate that our method outperforms cutting-edge
contrastive and other self-supervised learning methods with a great margin
SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for Multi-label Image Classification
Recently, Mix-style data augmentation methods (e.g., Mixup and CutMix) have
shown promising performance in various visual tasks. However, these methods are
primarily designed for single-label images, ignoring the considerable
discrepancies between single- and multi-label images, i.e., a multi-label image
involves multiple co-occurred categories and fickle object scales. On the other
hand, previous multi-label image classification (MLIC) methods tend to design
elaborate models, bringing expensive computation. In this paper, we introduce a
simple but effective augmentation strategy for multi-label image
classification, namely SpliceMix. The "splice" in our method is two-fold: 1)
Each mixed image is a splice of several downsampled images in the form of a
grid, where the semantics of images attending to mixing are blended without
object deficiencies for alleviating co-occurred bias; 2) We splice mixed images
and the original mini-batch to form a new SpliceMixed mini-batch, which allows
an image with different scales to contribute to training together. Furthermore,
such splice in our SpliceMixed mini-batch enables interactions between mixed
images and original regular images. We also offer a simple and non-parametric
extension based on consistency learning (SpliceMix-CL) to show the flexible
extensibility of our SpliceMix. Extensive experiments on various tasks
demonstrate that only using SpliceMix with a baseline model (e.g., ResNet)
achieves better performance than state-of-the-art methods. Moreover, the
generalizability of our SpliceMix is further validated by the improvements in
current MLIC methods when married with our SpliceMix. The code is available at
https://github.com/zuiran/SpliceMix.Comment: 13 pages, 10 figure
On Exploring Node-feature and Graph-structure Diversities for Node Drop Graph Pooling
A pooling operation is essential for effective graph-level representation
learning, where the node drop pooling has become one mainstream graph pooling
technology. However, current node drop pooling methods usually keep the top-k
nodes according to their significance scores, which ignore the graph diversity
in terms of the node features and the graph structures, thus resulting in
suboptimal graph-level representations. To address the aforementioned issue, we
propose a novel plug-and-play score scheme and refer to it as MID, which
consists of a \textbf{M}ultidimensional score space with two operations,
\textit{i.e.}, fl\textbf{I}pscore and \textbf{D}ropscore. Specifically, the
multidimensional score space depicts the significance of nodes through multiple
criteria; the flipscore encourages the maintenance of dissimilar node features;
and the dropscore forces the model to notice diverse graph structures instead
of being stuck in significant local structures. To evaluate the effectiveness
of our proposed MID, we perform extensive experiments by applying it to a wide
variety of recent node drop pooling methods, including TopKPool, SAGPool,
GSAPool, and ASAP. Specifically, the proposed MID can efficiently and
consistently achieve about 2.8\% average improvements over the above four
methods on seventeen real-world graph classification datasets, including four
social datasets (IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, and COLLAB), and
thirteen biochemical datasets (D\&D, PROTEINS, NCI1, MUTAG, PTC-MR, NCI109,
ENZYMES, MUTAGENICITY, FRANKENSTEIN, HIV, BBBP, TOXCAST, and TOX21). Code is
available at~\url{https://github.com/whuchuang/mid}.Comment: 14 pages, 14 figure
Advancing Vision Transformers with Group-Mix Attention
Vision Transformers (ViTs) have been shown to enhance visual recognition
through modeling long-range dependencies with multi-head self-attention (MHSA),
which is typically formulated as Query-Key-Value computation. However, the
attention map generated from the Query and Key captures only token-to-token
correlations at one single granularity. In this paper, we argue that
self-attention should have a more comprehensive mechanism to capture
correlations among tokens and groups (i.e., multiple adjacent tokens) for
higher representational capacity. Thereby, we propose Group-Mix Attention (GMA)
as an advanced replacement for traditional self-attention, which can
simultaneously capture token-to-token, token-to-group, and group-to-group
correlations with various group sizes. To this end, GMA splits the Query, Key,
and Value into segments uniformly and performs different group aggregations to
generate group proxies. The attention map is computed based on the mixtures of
tokens and group proxies and used to re-combine the tokens and groups in Value.
Based on GMA, we introduce a powerful backbone, namely GroupMixFormer, which
achieves state-of-the-art performance in image classification, object
detection, and semantic segmentation with fewer parameters than existing
models. For instance, GroupMixFormer-L (with 70.3M parameters and 384^2 input)
attains 86.2% Top-1 accuracy on ImageNet-1K without external data, while
GroupMixFormer-B (with 45.8M parameters) attains 51.2% mIoU on ADE20K
Not All Instances Contribute Equally: Instance-adaptive Class Representation Learning for Few-Shot Visual Recognition
Few-shot visual recognition refers to recognize novel visual concepts from a
few labeled instances. Many few-shot visual recognition methods adopt the
metric-based meta-learning paradigm by comparing the query representation with
class representations to predict the category of query instance. However,
current metric-based methods generally treat all instances equally and
consequently often obtain biased class representation, considering not all
instances are equally significant when summarizing the instance-level
representations for the class-level representation. For example, some instances
may contain unrepresentative information, such as too much background and
information of unrelated concepts, which skew the results. To address the above
issues, we propose a novel metric-based meta-learning framework termed
instance-adaptive class representation learning network (ICRL-Net) for few-shot
visual recognition. Specifically, we develop an adaptive instance revaluing
network with the capability to address the biased representation issue when
generating the class representation, by learning and assigning adaptive weights
for different instances according to their relative significance in the support
set of corresponding class. Additionally, we design an improved bilinear
instance representation and incorporate two novel structural losses, i.e.,
intra-class instance clustering loss and inter-class representation
distinguishing loss, to further regulate the instance revaluation process and
refine the class representation. We conduct extensive experiments on four
commonly adopted few-shot benchmarks: miniImageNet, tieredImageNet, CIFAR-FS,
and FC100 datasets. The experimental results compared with the state-of-the-art
approaches demonstrate the superiority of our ICRL-Net
- …