42 research outputs found
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Transferring knowledge from task-agnostic pre-trained deep models for
downstream tasks is an important topic in computer vision research. Along with
the growth of computational capacity, we now have open-source vision-language
pre-trained models in large scales of the model architecture and amount of
data. In this study, we focus on transferring knowledge for video
classification tasks. Conventional methods randomly initialize the linear
classifier head for vision classification, but they leave the usage of the text
encoder for downstream visual recognition tasks undiscovered. In this paper, we
revise the role of the linear classifier and replace the classifier with the
different knowledge from pre-trained model. We utilize the well-pretrained
language model to generate good semantic target for efficient transferring
learning. The empirical study shows that our method improves both the
performance and the training speed of video classification, with a negligible
change in the model. Our simple yet effective tuning paradigm achieves
state-of-the-art performance and efficient training on various video
recognition scenarios, i.e., zero-shot, few-shot, general recognition. In
particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on
Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1
accuracy under zero-shot, few-shot settings on five popular video datasets.
Code and models can be found at https://github.com/whwu95/Text4Vis .Comment: Accepted by AAAI-2023. Camera Ready Versio
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions
Click-aware Structure Transfer with Sample Weight Assignment for Post-Click Conversion Rate Estimation
Post-click Conversion Rate (CVR) prediction task plays an essential role in
industrial applications, such as recommendation and advertising. Conventional
CVR methods typically suffer from the data sparsity problem as they rely only
on samples where the user has clicked. To address this problem, researchers
have introduced the method of multi-task learning, which utilizes non-clicked
samples and shares feature representations of the Click-Through Rate (CTR) task
with the CVR task. However, it should be noted that the CVR and CTR tasks are
fundamentally different and may even be contradictory. Therefore, introducing a
large amount of CTR information without distinction may drown out valuable
information related to CVR. This phenomenon is called the curse of knowledge
problem in this paper. To tackle this issue, we argue that a trade-off should
be achieved between the introduction of large amounts of auxiliary information
and the protection of valuable information related to CVR. Hence, we propose a
Click-aware Structure Transfer model with sample Weight Assignment, abbreviated
as CSTWA. It pays more attention to the latent structure information, which can
filter the input information that is related to CVR, instead of directly
sharing feature representations. Meanwhile, to capture the representation
conflict between CTR and CVR, we calibrate the representation layer and
reweight the discriminant layer to excavate the click bias information from the
CTR tower. Moreover, it incorporates a sample weight assignment algorithm
biased towards CVR modeling, to make the knowledge from CTR would not mislead
the CVR. Extensive experiments on industrial and public datasets have
demonstrated that CSTWA significantly outperforms widely used and competitive
models
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Temporal modeling plays a crucial role in understanding video content. To
tackle this problem, previous studies built complicated temporal relations
through time sequence thanks to the development of computationally powerful
devices. In this work, we explore the potential of four simple arithmetic
operations for temporal modeling. Specifically, we first capture auxiliary
temporal cues by computing addition, subtraction, multiplication, and division
between pairs of extracted frame features. Then, we extract corresponding
features from these cues to benefit the original temporal-irrespective domain.
We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which
operates on the stem of a visual backbone with a plug-andplay style. We conduct
comprehensive ablation studies on the instantiation of ATMs and demonstrate
that this module provides powerful temporal modeling capability at a low
computational cost. Moreover, the ATM is compatible with both CNNs- and
ViTs-based architectures. Our results show that ATM achieves superior
performance over several popular video benchmarks. Specifically, on
Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%,
74.6%, and 89.4% respectively. The code is available at
https://github.com/whwu95/ATM.Comment: Accepted by ICCV 202
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
This paper does not present a novel method. Instead, it delves into an
essential, yet must-know baseline in light of the latest advancements in
Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual
understanding. Our study centers on the evaluation of GPT-4's linguistic and
visual capabilities in zero-shot visual recognition tasks: Firstly, we explore
the potential of its generated rich textual descriptions across various
categories to enhance recognition performance without any training. Secondly,
we evaluate GPT-4's visual proficiency in directly recognizing diverse visual
content. We conducted extensive experiments to systematically evaluate GPT-4's
performance across images, videos, and point clouds, using 16 benchmark
datasets to measure top-1 and top-5 accuracy. Our findings show that GPT-4,
enhanced with rich linguistic descriptions, significantly improves zero-shot
recognition, offering an average top-1 accuracy increase of 7% across all
datasets. GPT-4 excels in visual recognition, outshining OpenAI-CLIP's ViT-L
and rivaling EVA-CLIP's ViT-E, particularly in video datasets HMDB-51 and
UCF-101, where it leads by 22% and 9%, respectively. We hope this research
contributes valuable data points and experience for future studies. We release
our code at https://github.com/whwu95/GPT4Vis.Comment: Technical report. Retest GPT-4V and update result
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Vision-language models (VLMs) pre-trained on large-scale image-text pairs
have demonstrated impressive transferability on various visual tasks.
Transferring knowledge from such powerful VLMs is a promising direction for
building effective video recognition models. However, current exploration in
this field is still limited. We believe that the greatest value of pre-trained
VLMs lies in building a bridge between visual and textual domains. In this
paper, we propose a novel framework called BIKE, which utilizes the cross-modal
bridge to explore bidirectional knowledge: i) We introduce the Video Attribute
Association mechanism, which leverages the Video-to-Text knowledge to generate
textual auxiliary attributes for complementing video recognition. ii) We also
present a Temporal Concept Spotting mechanism that uses the Text-to-Video
expertise to capture temporal saliency in a parameter-free manner, leading to
enhanced video representation. Extensive studies on six popular video datasets,
including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show
that our method achieves state-of-the-art performance in various recognition
scenarios, such as general, zero-shot, and few-shot video recognition. Our best
model achieves a state-of-the-art accuracy of 88.6% on the challenging
Kinetics-400 using the released CLIP model. The code is available at
https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202
DetToolChain: a new prompting paradigm to unleash detection ability of MLLM
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and Gemini. Our approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought to implement these prompts. Specifically, the prompts in the toolkit are designed to guide the MLLM to focus on regional information (e.g., zooming in), read coordinates according to measure standards (e.g., overlaying rulers and compasses), and infer from the contextual information (e.g., overlaying scene graphs). Building upon these tools, the new detection chain-of-thought can automatically decompose the task into simple subtasks, diagnose the predictions, and plan for progressive box refinements. The effectiveness of our framework is demonstrated across a spectrum of detection tasks, especially hard cases. Compared to existing state-of-the-art methods, GPT4V with our DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, +14.5% AP on D-cube describe object detection FULL setting
Recommended from our members
The interplay between thermodynamics and kinetics in the solid-state synthesis of layered oxides.
In the synthesis of inorganic materials, reactions often yield non-equilibrium kinetic byproducts instead of the thermodynamic equilibrium phase. Understanding the competition between thermodynamics and kinetics is a fundamental step towards the rational synthesis of target materials. Here, we use in situ synchrotron X-ray diffraction to investigate the multistage crystallization pathways of the important two-layer (P2) sodium oxides Na0.67MO2 (M = Co, Mn). We observe a series of fast non-equilibrium phase transformations through metastable three-layer O3, O3' and P3 phases before formation of the equilibrium two-layer P2 polymorph. We present a theoretical framework to rationalize the observed phase progression, demonstrating that even though P2 is the equilibrium phase, compositionally unconstrained reactions between powder precursors favour the formation of non-equilibrium three-layered intermediates. These insights can guide the choice of precursors and parameters employed in the solid-state synthesis of ceramic materials, and constitutes a step forward in unravelling the complex interplay between thermodynamics and kinetics during materials synthesis
Homeostatic regulation of extracellular signal-regulated kinase 1/2 activity and axonal Kv7.3 expression by prolonged blockade of hippocampal neuronal activity
Homeostatic plasticity encompasses the mechanisms by which neurons stabilize their synaptic strength and excitability in response to prolonged and destabilizing changes in their network activity. Prolonged activity blockade leads to homeostatic scaling of action potential (AP) firing rate in hippocampal neurons in part by decreased activity of N-Methyl-D-Aspartate receptors and subsequent transcriptional down-regulation of potassium channel genes including KCNQ3 which encodes Kv7.3. Neuronal Kv7 channels are mostly heterotetramers of Kv7.2 and Kv7.3 subunits and are highly enriched at the axon initial segment (AIS) where their current potently inhibits repetitive and burst firing of APs. However, whether a decrease in Kv7.3 expression occurs at the AIS during homeostatic scaling of intrinsic excitability and what signaling pathway reduces KCNQ3 transcript upon prolonged activity blockade remain unknown. Here, we report that prolonged activity blockade in cultured hippocampal neurons reduces the activity of extracellular signal-regulated kinase 1/2 (ERK1/2) followed by a decrease in the activation of brain-derived neurotrophic factor (BDNF) receptor, Tropomyosin receptor kinase B (TrkB). Furthermore, both prolonged activity blockade and prolonged pharmacological inhibition of ERK1/2 decrease KCNQ3 and BDNF transcripts as well as the density of Kv7.3 and ankyrin-G at the AIS. Collectively, our findings suggest that a reduction in the ERK1/2 activity and subsequent transcriptional down-regulation may serve as a potential signaling pathway that links prolonged activity blockade to homeostatic control of BDNF-TrkB signaling and Kv7.3 density at the AIS during homeostatic scaling of AP firing rate