Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

Ji, Yatai; Liu, Yong; Ni, Xinzhe; Wen, Hao; Yang, Yujiu

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

Authors: Yatai Ji
Yong Liu
Xinzhe Ni
Hao Wen
Yujiu Yang
Publication date: 9 December 2022
Publisher

Abstract

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2212.04873

Last time updated on 08/01/2023