27 research outputs found
Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition
Current methods for few-shot action recognition mainly fall into the metric
learning framework following ProtoNet. However, they either ignore the effect
of representative prototypes or fail to enhance the prototypes with multimodal
information adequately. In this work, we propose a novel Multimodal
Prototype-Enhanced Network (MORN) to use the semantic information of label
texts as multimodal information to enhance prototypes, including two modality
flows. A CLIP visual encoder is introduced in the visual flow, and visual
prototypes are computed by the Temporal-Relational CrossTransformer (TRX)
module. A frozen CLIP text encoder is introduced in the text flow, and a
semantic-enhanced module is used to enhance text features. After inflating,
text prototypes are obtained. The final multimodal prototypes are then computed
by a multimodal prototype-enhanced module. Besides, there exist no evaluation
metrics to evaluate the quality of prototypes. To the best of our knowledge, we
are the first to propose a prototype evaluation metric called Prototype
Similarity Difference (PRIDE), which is used to evaluate the performance of
prototypes in discriminating different categories. We conduct extensive
experiments on four popular datasets. MORN achieves state-of-the-art results on
HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we
explore the correlation between PRIDE and accuracy
Learning to Sit: Synthesizing Human-Chair Interactions via Hierarchical Control
Recent progress on physics-based character animation has shown impressive
breakthroughs on human motion synthesis, through imitating motion capture data
via deep reinforcement learning. However, results have mostly been demonstrated
on imitating a single distinct motion pattern, and do not generalize to
interactive tasks that require flexible motion patterns due to varying
human-object spatial configurations. To bridge this gap, we focus on one class
of interactive tasks -- sitting onto a chair. We propose a hierarchical
reinforcement learning framework which relies on a collection of subtask
controllers trained to imitate simple, reusable mocap motions, and a meta
controller trained to execute the subtasks properly to complete the main task.
We experimentally demonstrate the strength of our approach over different
non-hierarchical and hierarchical baselines. We also show that our approach can
be applied to motion prediction given an image input. A supplementary video can
be found at https://youtu.be/3CeN0OGz2cA.Comment: Accepted to AAAI 202
Meta-Auxiliary Learning for Adaptive Human Pose Prediction
Predicting high-fidelity future human poses, from a historically observed
sequence, is decisive for intelligent robots to interact with humans. Deep
end-to-end learning approaches, which typically train a generic pre-trained
model on external datasets and then directly apply it to all test samples,
emerge as the dominant solution to solve this issue. Despite encouraging
progress, they remain non-optimal, as the unique properties (e.g., motion
style, rhythm) of a specific sequence cannot be adapted. More generally, at
test-time, once encountering unseen motion categories (out-of-distribution),
the predicted poses tend to be unreliable. Motivated by this observation, we
propose a novel test-time adaptation framework that leverages two
self-supervised auxiliary tasks to help the primary forecasting network adapt
to the test sequence. In the testing phase, our model can adjust the model
parameters by several gradient updates to improve the generation quality.
However, due to catastrophic forgetting, both auxiliary tasks typically tend to
the low ability to automatically present the desired positive incentives for
the final prediction performance. For this reason, we also propose a
meta-auxiliary learning scheme for better adaptation. In terms of general
setup, our approach obtains higher accuracy, and under two new experimental
designs for out-of-distribution data (unseen subjects and categories), achieves
significant improvements.Comment: 10 pages, 6 figures, AAAI 2023 accepte