Search CORE

7 research outputs found

Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition

Author: Ji Yatai
Liu Yong
Ni Xinzhe
Wen Hao
Yang Yujiu
Publication venue
Publication date: 09/12/2022
Field of study

Current methods for few-shot action recognition mainly fall into the metric learning framework following ProtoNet. However, they either ignore the effect of representative prototypes or fail to enhance the prototypes with multimodal information adequately. In this work, we propose a novel Multimodal Prototype-Enhanced Network (MORN) to use the semantic information of label texts as multimodal information to enhance prototypes, including two modality flows. A CLIP visual encoder is introduced in the visual flow, and visual prototypes are computed by the Temporal-Relational CrossTransformer (TRX) module. A frozen CLIP text encoder is introduced in the text flow, and a semantic-enhanced module is used to enhance text features. After inflating, text prototypes are obtained. The final multimodal prototypes are then computed by a multimodal prototype-enhanced module. Besides, there exist no evaluation metrics to evaluate the quality of prototypes. To the best of our knowledge, we are the first to propose a prototype evaluation metric called Prototype Similarity Difference (PRIDE), which is used to evaluate the performance of prototypes in discriminating different categories. We conduct extensive experiments on four popular datasets. MORN achieves state-of-the-art results on HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we explore the correlation between PRIDE and accuracy

arXiv.org e-Print Archive

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Author: Bian Hengwei
Ji Yatai
Li Xiu
Liu Yong
Luo Zhuoyan
Ma Yue
Xiao Yicheng
Yang Yujiu
Publication venue
Publication date: 27/11/2023
Field of study

Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin

arXiv.org e-Print Archive

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

Author: Gong Yuan
Ji Yatai
Sakai Tetsuya
Wang Hongfa
Wang Junjie
Yang Yujiu
Zhang Jiaxing
Zhang Lin
Zhu Yanru
Publication venue
Publication date: 26/03/2023
Field of study

Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.Comment: CVPR 2023 accep

arXiv.org e-Print Archive

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning

Author: Cai Chengfei
Ji Yatai
Jiang Jie
Kong Weijie
Liu Wei
Tu Rongcheng
Wang Hongfa
Yang Yujiu
Zhao Wenzhe
Publication venue
Publication date: 24/11/2022
Field of study

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in the limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval

arXiv.org e-Print Archive

A User Interface Design for Collaborations between Humans and Intelligent Vehicles

Author: Yatai Ji
Publication venue: 'Center for Open Science'
Publication date: 30/03/2024
Field of study

OSF Preprints

Campusense: Capturing Crowd Movements on Campus through a Mobile Web Application

Author: Yatai Ji
Publication venue: 'Center for Open Science'
Publication date: 12/04/2024
Field of study

OSF Preprints

Reliability-Oriented Design of Inverter-Fed Low-Voltage Electrical Machines: Potential Solutions

Author: Michael Galea
Paolo Giangrande
Vincenzo Madonna
Weiduo Zhao
Yatai Ji
Publication venue: 'MDPI AG'
Publication date: 01/07/2021
Field of study

Transportation electrification has kept pushing low-voltage inverter-fed electrical machines to reach a higher power density while guaranteeing appropriate reliability levels. Methods commonly adopted to boost power density (i.e., higher current density, faster switching frequency for high speed, and higher DC link voltage) will unavoidably increase the stress to the insulation system which leads to a decrease in reliability. Thus, a trade-off is required between power density and reliability during the machine design. Currently, it is a challenging task to evaluate reliability during the design stage and the over-engineering approach is applied. To solve this problem, physics of failure (POF) is introduced and its feasibility for electrical machine (EM) design is discussed through reviewing past work on insulation investigation. Then the special focus is given to partial discharge (PD) whose occurrence means the end-of-life of low-voltage EMs. The PD-free design methodology based on understanding the physics of PD is presented to substitute the over-engineering approach. Finally, a comprehensive reliability-oriented design (ROD) approach adopting POF and PD-free design strategy is given as a potential solution for reliable and high-performance inverter-fed low-voltage EM design

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Open Access Repository