171 research outputs found
Global Relation Modeling and Refinement for Bottom-Up Human Pose Estimation
In this paper, we concern on the bottom-up paradigm in multi-person pose
estimation (MPPE). Most previous bottom-up methods try to consider the relation
of instances to identify different body parts during the post processing, while
ignoring to model the relation among instances or environment in the feature
learning process. In addition, most existing works adopt the operations of
upsampling and downsampling. During the sampling process, there will be a
problem of misalignment with the source features, resulting in deviations in
the keypoint features learned by the model.
To overcome the above limitations, we propose a convolutional neural network
for bottom-up human pose estimation. It invovles two basic modules: (i) Global
Relation Modeling (GRM) module globally learns relation (e.g., environment
context, instance interactive information) among region of image by fusing
multiple stages features in the feature learning process. It combines with the
spatial-channel attention mechanism, which focuses on achieving adaptability in
spatial and channel dimensions. (ii) Multi-branch Feature Align (MFA) module
aggregates features from multiple branches to align fused feature and obtain
refined local keypoint representation. Our model has the ability to focus on
different granularity from local to global regions, which significantly boosts
the performance of the multi-person pose estimation. Our results on the COCO
and CrowdPose datasets demonstrate that it is an efficient framework for
multi-person pose estimation
A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition
Human Interaction Recognition is the process of identifying interactive
actions between multiple participants in a specific situation. The aim is to
recognise the action interactions between multiple entities and their meaning.
Many single Convolutional Neural Network has issues, such as the inability to
capture global instance interaction features or difficulty in training, leading
to ambiguity in action semantics. In addition, the computational complexity of
the Transformer cannot be ignored, and its ability to capture local information
and motion features in the image is poor. In this work, we propose a Two-stream
Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity
of CNN and models global dependencies through the Transformer. CNN and
Transformer simultaneously model the entity, time and space relationships
between interactive entities respectively. Specifically, Transformer-based
stream integrates 3D convolutions with multi-head self-attention to learn
inter-token correlations; We propose a new multi-branch CNN framework for
CNN-based streams that automatically learns joint spatio-temporal features from
skeleton sequences. The convolutional layer independently learns the local
features of each joint neighborhood and aggregates the features of all joints.
And the raw skeleton coordinates as well as their temporal difference are
integrated with a dual-branch paradigm to fuse the motion features of the
skeleton. Besides, a residual structure is added to speed up training
convergence. Finally, the recognition results of the two branches are fused
using parallel splicing. Experimental results on diverse and challenging
datasets, demonstrate that the proposed method can better comprehend and infer
the meaning and context of various actions, outperforming state-of-the-art
methods
Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios
Audio-visual question answering (AVQA) is a challenging task that requires
multistep spatio-temporal reasoning over multimodal contexts. Recent works rely
on elaborate target-agnostic parsing of audio-visual scenes for spatial
grounding while mistreating audio and video as separate entities for temporal
grounding. This paper proposes a new target-aware joint spatio-temporal
grounding network for AVQA. It consists of two key components: the target-aware
spatial grounding module (TSG) and the single-stream joint audio-visual
temporal grounding module (JTG). The TSG can focus on audio-visual cues
relevant to the query subject by utilizing explicit semantics from the
question. Unlike previous two-stream temporal grounding modules that required
an additional audio-visual fusion module, JTG incorporates audio-visual fusion
and question-aware temporal grounding into one module with a simpler
single-stream architecture. The temporal synchronization between audio and
video in the JTG is facilitated by our proposed cross-modal synchrony loss
(CSL). Extensive experiments verified the effectiveness of our proposed method
over existing state-of-the-art methods.Comment: Accepted to EMNLP 2023 Finding
Relation-Based Associative Joint Location for Human Pose Estimation in Videos
Video-based human pose estimation (HPE) is a vital yet challenging task.
While deep learning methods have made significant progress for the HPE, most
approaches to this task detect each joint independently, damaging the pose
structural information. In this paper, unlike the prior methods, we propose a
Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints
associatively. Specifically, we design a lightweight joint relation extractor
(JRE) to model the pose structural features and associatively generate heatmaps
for joints by modeling the relation between any two joints heuristically
instead of building each joint heatmap independently. Actually, the proposed
JRE module models the spatial configuration of human poses through the
relationship between any two joints. Moreover, considering the temporal
semantic continuity of videos, the pose semantic information in the current
frame is beneficial for guiding the location of joints in the next frame.
Therefore, we use the idea of knowledge reuse to propagate the pose semantic
information between consecutive frames. In this way, the proposed RPSTN
captures temporal dynamics of poses. On the one hand, the JRE module can infer
invisible joints according to the relationship between the invisible joints and
other visible joints in space. On the other hand, in the time, the propose
model can transfer the pose semantic features from the non-occluded frame to
the occluded frame to locate occluded joints. Therefore, our method is robust
to the occlusion and achieves state-of-the-art results on the two challenging
datasets, which demonstrates its effectiveness for video-based human pose
estimation. We will release the code and models publicly
CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering
While vision-language pretrained models (VLMs) excel in various multimodal
understanding tasks, their potential in fine-grained audio-visual reasoning,
particularly for audio-visual question answering (AVQA), remains largely
unexplored. AVQA presents specific challenges for VLMs due to the requirement
of visual understanding at the region level and seamless integration with audio
modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder
but underutilized its knowledge, and mistreated audio and video as separate
entities in a dual-stream framework as most AVQA methods. This paper proposes a
new CLIP-powered target-aware single-stream (TASS) network for AVQA using the
image-text matching knowledge of the pretrained model through the audio-visual
matching characteristic of nature. It consists of two key components: the
target-aware spatial grounding module (TSG+) and the single-stream joint
temporal grounding module (JTG). Specifically, we propose a TSG+ module to
transfer the image-text matching knowledge from CLIP models to our region-text
matching process without corresponding ground-truth labels. Moreover, unlike
previous separate dual-stream networks that still required an additional
audio-visual fusion module, JTG unifies audio-visual fusion and question-aware
temporal grounding in a simplified single-stream architecture. It treats audio
and video as a cohesive entity and further extends the pretrained image-text
knowledge to audio-text matching by preserving their temporal correlation with
our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted
on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method
over existing state-of-the-art methods.Comment: Submitted to the Journal on February 6, 202
MLP-AIR: An Efficient MLP-Based Method for Actor Interaction Relation Learning in Group Activity Recognition
The task of Group Activity Recognition (GAR) aims to predict the activity
category of the group by learning the actor spatial-temporal interaction
relation in the group. Therefore, an effective actor relation learning method
is crucial for the GAR task. The previous works mainly learn the interaction
relation by the well-designed GCNs or Transformers. For example, to infer the
actor interaction relation, GCNs need a learnable adjacency, and Transformers
need to calculate the self-attention. Although the above methods can model the
interaction relation effectively, they also increase the complexity of the
model (the number of parameters and computations). In this paper, we design a
novel MLP-based method for Actor Interaction Relation learning (MLP-AIR) in
GAR. Compared with GCNs and Transformers, our method has a competitive but
conceptually and technically simple alternative, significantly reducing the
complexity. Specifically, MLP-AIR includes three sub-modules: MLP-based Spatial
relation modeling module (MLP-S), MLP-based Temporal relation modeling module
(MLP-T), and MLP-based Relation refining module (MLP-R). MLP-S is used to model
the spatial relation between different actors in each frame. MLP-T is used to
model the temporal relation between different frames for each actor. MLP-R is
used further to refine the relation between different dimensions of relation
features to improve the feature's expression ability. To evaluate the MLP-AIR,
we conduct extensive experiments on two widely used benchmarks, including the
Volleyball and Collective Activity datasets. Experimental results demonstrate
that MLP-AIR can get competitive results but with low complexity.Comment: Submit to Neurocomputin
SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection
In the perception task of autonomous driving, multi-modal methods have become
a trend due to the complementary characteristics of LiDAR point clouds and
image data. However, the performance of previous methods is usually limited by
the sparsity of the point cloud or the noise problem caused by the misalignment
between LiDAR and the camera. To solve these two problems, we present a new
concept, Voxel Region (VR), which is obtained by projecting the sparse local
point clouds in each voxel dynamically. And we propose a novel fusion method,
named Sparse-to-Dense Voxel Region Fusion (SDVRF). Specifically, more pixels of
the image feature map inside the VR are gathered to supplement the voxel
feature extracted from sparse points and achieve denser fusion. Meanwhile,
different from prior methods, which project the size-fixed grids, our strategy
of generating dynamic regions achieves better alignment and avoids introducing
too much background noise. Furthermore, we propose a multi-scale fusion
framework to extract more contextual information and capture the features of
objects of different sizes. Experiments on the KITTI dataset show that our
method improves the performance of different baselines, especially on classes
of small size, including Pedestrian and Cyclist.Comment: Submitted to IEEE Transactions on Circuits and Systems for Video
Technolog
- …
