70 research outputs found
Spatio-Temporal AU Relational Graph Representation Learning For Facial Action Units Detection
This paper presents our Facial Action Units (AUs) recognition submission to
the fifth Affective Behavior Analysis in-the-wild Competition (ABAW). Our
approach consists of three main modules: (i) a pre-trained facial
representation encoder which produce a strong facial representation from each
input face image in the input sequence; (ii) an AU-specific feature generator
that specifically learns a set of AU features from each facial representation;
and (iii) a spatio-temporal graph learning module that constructs a
spatio-temporal graph representation. This graph representation describes AUs
contained in all frames and predicts the occurrence of each AU based on both
the modeled spatial information within the corresponding face and the learned
temporal dynamics among frames. The experimental results show that our approach
outperformed the baseline and the spatio-temporal graph representation learning
allows our model to generate the best results among all ablated systems. Our
model ranks at the 4th place in the AU recognition track at the 5th ABAW
Competition
Scene Consistency Representation Learning for Video Scene Segmentation
A long-term video, such as a movie or TV show, is composed of various scenes,
each of which represents a series of shots sharing the same semantic story.
Spotting the correct scene boundary from the long-term video is a challenging
task, since a model must understand the storyline of the video to figure out
where a scene starts and ends. To this end, we propose an effective
Self-Supervised Learning (SSL) framework to learn better shot representations
from unlabeled long-term videos. More specifically, we present an SSL scheme to
achieve scene consistency, while exploring considerable data augmentation and
shuffling methods to boost the model generalizability. Instead of explicitly
learning the scene boundary features as in the previous methods, we introduce a
vanilla temporal model with less inductive bias to verify the quality of the
shot features. Our method achieves the state-of-the-art performance on the task
of Video Scene Segmentation. Additionally, we suggest a more fair and
reasonable benchmark to evaluate the performance of Video Scene Segmentation
methods. The code is made available.Comment: Accepted to CVPR 202
Boosting Adversarial Transferability across Model Genus by Deformation-Constrained Warping
Adversarial examples generated by a surrogate model typically exhibit limited
transferability to unknown target systems. To address this problem, many
transferability enhancement approaches (e.g., input transformation and model
augmentation) have been proposed. However, they show poor performances in
attacking systems having different model genera from the surrogate model. In
this paper, we propose a novel and generic attacking strategy, called
Deformation-Constrained Warping Attack (DeCoWA), that can be effectively
applied to cross model genus attack. Specifically, DeCoWA firstly augments
input examples via an elastic deformation, namely Deformation-Constrained
Warping (DeCoW), to obtain rich local details of the augmented input. To avoid
severe distortion of global semantics led by random deformation, DeCoW further
constrains the strength and direction of the warping transformation by a novel
adaptive control strategy. Extensive experiments demonstrate that the
transferable examples crafted by our DeCoWA on CNN surrogates can significantly
hinder the performance of Transformers (and vice versa) on various tasks,
including image classification, video action recognition, and audio
recognition. Code is made available at https://github.com/LinQinLiang/DeCoWA.Comment: AAAI 202
Joint Prediction of Audio Event and Annoyance Rating in an Urban Soundscape by Hierarchical Graph Representation Learning
Sound events in daily life carry rich information about the objective world. The composition of these sounds affects the
mood of people in a soundscape. Most previous approaches
only focus on classifying and detecting audio events and scenes,
but may ignore their perceptual quality that may impact humans’ listening mood for the environment, e.g. annoyance. To
this end, this paper proposes a novel hierarchical graph representation learning (HGRL) approach which links objective audio events (AE) with subjective annoyance ratings (AR) of the
soundscape perceived by humans. The hierarchical graph consists of fine-grained event (fAE) embeddings with single-class
event semantics, coarse-grained event (cAE) embeddings with
multi-class event semantics, and AR embeddings. Experiments
show the proposed HGRL successfully integrates AE with AR
for AEC and ARP tasks, while coordinating the relations between cAE and fAE and further aligning the two different grains
of AE information with the AR
GRATIS: Deep Learning Graph Representation with Task-specific Topology and Multi-dimensional Edge Features
Graph is powerful for representing various types of real-world data. The
topology (edges' presence) and edges' features of a graph decides the message
passing mechanism among vertices within the graph. While most existing
approaches only manually define a single-value edge to describe the
connectivity or strength of association between a pair of vertices,
task-specific and crucial relationship cues may be disregarded by such manually
defined topology and single-value edge features. In this paper, we propose the
first general graph representation learning framework (called GRATIS) which can
generate a strong graph representation with a task-specific topology and
task-specific multi-dimensional edge features from any arbitrary input. To
learn each edge's presence and multi-dimensional feature, our framework takes
both of the corresponding vertices pair and their global contextual information
into consideration, enabling the generated graph representation to have a
globally optimal message passing mechanism for different down-stream tasks. The
principled investigation results achieved for various graph analysis tasks on
11 graph and non-graph datasets show that our GRATIS can not only largely
enhance pre-defined graphs but also learns a strong graph representation for
non-graph data, with clear performance improvements on all tasks. In
particular, the learned topology and multi-dimensional edge features provide
complementary task-related cues for graph analysis tasks. Our framework is
effective, robust and flexible, and is a plug-and-play module that can be
combined with different backbones and Graph Neural Networks (GNNs) to generate
a task-specific graph representation from various graph and non-graph data. Our
code is made publicly available at
https://github.com/SSYSteve/Learning-Graph-Representation-with-Task-specific-Topology-and-Multi-dimensional-Edge-Features
- …