Search CORE

42 research outputs found

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition

Author: Ouyang Wanli
Sun Zhun
Wu Wenhao
Publication venue
Publication date: 26/03/2023
Field of study

Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five popular video datasets. Code and models can be found at https://github.com/whwu95/Text4Vis .Comment: Accepted by AAAI-2023. Camera Ready Versio

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Author: Fang Bo
Luo Haipeng
Ouyang Wanli
Wang Jingdong
Wu Wenhao
Publication venue
Publication date: 28/03/2023
Field of study

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL submissions

arXiv.org e-Print Archive

Click-aware Structure Transfer with Sample Weight Assignment for Post-Click Conversion Rate Estimation

Author: Ouyang Kai
Tang Chen
Xiao Xuanji
Zheng Hai-Tao
Zheng Wenhao
Publication venue
Publication date: 03/04/2023
Field of study

Post-click Conversion Rate (CVR) prediction task plays an essential role in industrial applications, such as recommendation and advertising. Conventional CVR methods typically suffer from the data sparsity problem as they rely only on samples where the user has clicked. To address this problem, researchers have introduced the method of multi-task learning, which utilizes non-clicked samples and shares feature representations of the Click-Through Rate (CTR) task with the CVR task. However, it should be noted that the CVR and CTR tasks are fundamentally different and may even be contradictory. Therefore, introducing a large amount of CTR information without distinction may drown out valuable information related to CVR. This phenomenon is called the curse of knowledge problem in this paper. To tackle this issue, we argue that a trade-off should be achieved between the introduction of large amounts of auxiliary information and the protection of valuable information related to CVR. Hence, we propose a Click-aware Structure Transfer model with sample Weight Assignment, abbreviated as CSTWA. It pays more attention to the latent structure information, which can filter the input information that is related to CVR, instead of directly sharing feature representations. Meanwhile, to capture the representation conflict between CTR and CVR, we calibrate the representation layer and reweight the discriminant layer to excavate the click bias information from the CTR tower. Moreover, it incorporates a sample weight assignment algorithm biased towards CVR modeling, to make the knowledge from CTR would not mislead the CVR. Extensive experiments on industrial and public datasets have demonstrated that CSTWA significantly outperforms widely used and competitive models

arXiv.org e-Print Archive

What Can Simple Arithmetic Operations Do for Temporal Modeling?

Author: Ouyang Wanli
Song Yuxin
Sun Zhun
Wang Jingdong
Wu Wenhao
Xu Chang
Publication venue
Publication date: 17/07/2023
Field of study

Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-andplay style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM.Comment: Accepted by ICCV 202

arXiv.org e-Print Archive

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

Author: Ouyang Wanli
Song Yuxin
Wang Jingdong
Wu Wenhao
Yao Huanjin
Zhang Mengxi
Publication venue
Publication date: 11/03/2024
Field of study

This paper does not present a novel method. Instead, it delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks: Firstly, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training. Secondly, we evaluate GPT-4's visual proficiency in directly recognizing diverse visual content. We conducted extensive experiments to systematically evaluate GPT-4's performance across images, videos, and point clouds, using 16 benchmark datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition, offering an average top-1 accuracy increase of 7% across all datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and UCF-101, where it leads by 22% and 9%, respectively. We hope this research contributes valuable data points and experience for future studies. We release our code at https://github.com/whwu95/GPT4Vis.Comment: Technical report. Retest GPT-4V and update result

arXiv.org e-Print Archive

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Author: Luo Haipeng
Ouyang Wanli
Wang Jingdong
Wang Xiaohan
Wu Wenhao
Yang Yi
Publication venue
Publication date: 25/03/2023
Field of study

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202

arXiv.org e-Print Archive

DetToolChain: a new prompting paradigm to unleash detection ability of MLLM

Author: He Tong
Ouyang Wanli
Tang Shixiang
Torr Philip
Wang Yizhou
Wu Jian
Wu Wenhao
Wu Yixuan
Publication venue
Publication date: 11/06/2024
Field of study

We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting

Oxford University Research Archive

Recommended from our members

The interplay between thermodynamics and kinetics in the solid-state synthesis of layered oxides.

Author: Bai Jianming
Bianchini Matteo
Ceder Gerbrand
Clément Raphaële J
Kim Haegyeom
Kitchaev Daniil
Ouyang Bin
Shi Tan
Sun Wenhao
Wang Feng
Wang Jingyang
Wang Yan
Xiao Penghao
Zhang Mingjian
Zhang Yaqian
Publication venue: eScholarship, University of California
Publication date: 01/10/2020
Field of study

In the synthesis of inorganic materials, reactions often yield non-equilibrium kinetic byproducts instead of the thermodynamic equilibrium phase. Understanding the competition between thermodynamics and kinetics is a fundamental step towards the rational synthesis of target materials. Here, we use in situ synchrotron X-ray diffraction to investigate the multistage crystallization pathways of the important two-layer (P2) sodium oxides Na0.67MO2 (M = Co, Mn). We observe a series of fast non-equilibrium phase transformations through metastable three-layer O3, O3' and P3 phases before formation of the equilibrium two-layer P2 polymorph. We present a theoretical framework to rationalize the observed phase progression, demonstrating that even though P2 is the equilibrium phase, compositionally unconstrained reactions between powder precursors favour the formation of non-equilibrium three-layered intermediates. These insights can guide the choice of precursors and parameters employed in the solid-state synthesis of ceramic materials, and constitutes a step forward in unravelling the complex interplay between thermodynamics and kinetics during materials synthesis

eScholarship - University of California

Homeostatic regulation of extracellular signal-regulated kinase 1/2 activity and axonal Kv7.3 expression by prolonged blockade of hippocampal neuronal activity

Author: Amanda C. Weiss
Amanda C. Weiss
Brian C. Baculis
Brian C. Baculis
Edward H. Kim
Gregory C. Tracy
Harish Kesavan
Hee Jung Chung
Hee Jung Chung
Hee Jung Chung
Hee Jung Chung
Nien-Pei Tsai
Nien-Pei Tsai
Wenhao Ouyang
Wenhao Ouyang
Publication venue: 'Frontiers Media SA'
Publication date: 01/07/2022
Field of study

Homeostatic plasticity encompasses the mechanisms by which neurons stabilize their synaptic strength and excitability in response to prolonged and destabilizing changes in their network activity. Prolonged activity blockade leads to homeostatic scaling of action potential (AP) firing rate in hippocampal neurons in part by decreased activity of N-Methyl-D-Aspartate receptors and subsequent transcriptional down-regulation of potassium channel genes including KCNQ3 which encodes Kv7.3. Neuronal Kv7 channels are mostly heterotetramers of Kv7.2 and Kv7.3 subunits and are highly enriched at the axon initial segment (AIS) where their current potently inhibits repetitive and burst firing of APs. However, whether a decrease in Kv7.3 expression occurs at the AIS during homeostatic scaling of intrinsic excitability and what signaling pathway reduces KCNQ3 transcript upon prolonged activity blockade remain unknown. Here, we report that prolonged activity blockade in cultured hippocampal neurons reduces the activity of extracellular signal-regulated kinase 1/2 (ERK1/2) followed by a decrease in the activation of brain-derived neurotrophic factor (BDNF) receptor, Tropomyosin receptor kinase B (TrkB). Furthermore, both prolonged activity blockade and prolonged pharmacological inhibition of ERK1/2 decrease KCNQ3 and BDNF transcripts as well as the density of Kv7.3 and ankyrin-G at the AIS. Collectively, our findings suggest that a reduction in the ERK1/2 activity and subsequent transcriptional down-regulation may serve as a potential signaling pathway that links prolonged activity blockade to homeostatic control of BDNF-TrkB signaling and Kv7.3 density at the AIS during homeostatic scaling of AP firing rate

Directory of Open Access Journals