594 research outputs found
Learning Combinatorial Prompts for Universal Controllable Image Captioning
Controllable Image Captioning (CIC) -- generating natural language
descriptions about images under the guidance of given control signals -- is one
of the most promising directions towards next-generation captioning systems.
Till now, various kinds of control signals for CIC have been proposed, ranging
from content-related control to structure-related control. However, due to the
format and target gaps of different control signals, all existing CIC works (or
architectures) only focus on one certain control signal, and overlook the
human-like combinatorial ability. By ``combinatorial", we mean that our humans
can easily meet multiple needs (or constraints) simultaneously when generating
descriptions. To this end, we propose a novel prompt-based framework for CIC by
learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly
utilize a pretrained language model GPT-2 as our language model, which can help
to bridge the gap between different signal-specific CIC architectures. Then, we
reformulate the CIC as a prompt-guide sentence generation problem, and propose
a new lightweight prompt generation network to generate the combinatorial
prompts for different kinds of control signals. For different control signals,
we further design a new mask attention mechanism to realize the prompt-based
CIC. Due to its simplicity, our ComPro can easily be extended to more complex
combined control signals by concatenating these prompts. Extensive experiments
on two prevalent CIC benchmarks have verified the effectiveness and efficiency
of our ComPro on both single and combined control signals
A Comprehensive Survey of Automated Audio Captioning
Automated audio captioning, a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has
overseen much progress over the last few years. Audio captioning requires
recognizing the acoustic scene, primary audio events and sometimes the spatial
and temporal relationship between events in an audio clip. It also requires
describing these elements by a fluent and vivid sentence. Deep learning-based
approaches are widely adopted to tackle this problem. This current paper
situates itself as a comprehensive review covering the benchmark datasets,
existing deep learning techniques and the evaluation metrics in automated audio
captioning
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
In text-video retrieval, recent works have benefited from the powerful
learning capabilities of pre-trained text-image foundation models (e.g., CLIP)
by adapting them to the video domain. A critical problem for them is how to
effectively capture the rich semantics inside the video using the image encoder
of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal
modeling techniques to fuse the text information into video frame
representations, which, however, incurs severe efficiency issues in large-scale
retrieval systems as the video representations must be recomputed online for
every text query. In this paper, we discard this problematic cross-modal fusion
process and aim to learn semantically-enhanced representations purely from the
video, so that the video representations can be computed offline and reused for
different texts. Concretely, we first introduce a spatial-temporal "Prompt
Cube" into the CLIP image encoder and iteratively switch it within the encoder
layers to efficiently incorporate the global video semantics into frame
representations. We then propose to apply an auxiliary video captioning
objective to train the frame representations, which facilitates the learning of
detailed video semantics by providing fine-grained guidance in the semantic
space. With a naive temporal fusion strategy (i.e., mean-pooling) on the
enhanced frame representations, we obtain state-of-the-art performances on
three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.Comment: to be appeared in ICCV202
Dual Attention on Pyramid Feature Maps for Image Captioning
Generating natural sentences from images is a fundamental learning task for
visual-semantic understanding in multimedia. In this paper, we propose to apply
dual attention on pyramid image feature maps to fully explore the
visual-semantic correlations and improve the quality of generated sentences.
Specifically, with the full consideration of the contextual information
provided by the hidden state of the RNN controller, the pyramid attention can
better localize the visually indicative and semantically consistent regions in
images. On the other hand, the contextual information can help re-calibrate the
importance of feature components by learning the channel-wise dependencies, to
improve the discriminative power of visual features for better content
description. We conducted comprehensive experiments on three well-known
datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in
generating descriptive and smooth natural sentences from images. Using either
convolution visual features or more informative bottom-up attention features,
our composite captioning model achieves very promising performance in a
single-model mode. The proposed pyramid attention and dual attention methods
are highly modular, which can be inserted into various image captioning modules
to further improve the performance.Comment: in IEEE Transactions on Multimedia, 202
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
Evaluating the Performance of Transformer architecture over Attention architecture on Image Captioning
Over the last few decades computer vision and Natural Language processing has shown tremendous improvement in different tasks such as image captioning, video captioning, machine translation etc using deep learning models. However, there were not much researches related to image captioning based on transformers and how it outperforms other models that were implemented for image captioning. In this study will be designing a simple encoder-decoder model, attention model and transformer model for image captioning using Flickr8K dataset where will be discussing about the hyperparameters of the model, type of pre-trained model used and how long the model has been trained. Furthermore, will be comparing the captions generated by attention model and transformer model using BLEU score metrics, which will be further analysed using human evaluation conducted using intrinsic approach. After analysis of results obtained using statistical test conducted on BLEU score metrics and human evaluation it was found that transformer model with multi-head attention has outperformed attention model in image captioning
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Most video-and-language representation learning approaches employ contrastive
learning, e.g., CLIP, to project the video and text features into a common
latent space according to the semantic similarities of text-video pairs.
However, such learned shared latent spaces are not often optimal, and the
modality gap between visual and textual representation can not be fully
eliminated. In this paper, we propose Expectation-Maximization Contrastive
Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact
set of bases for the latent space, where the features could be concisely
represented as the linear combinations of these bases. Such feature
decomposition of video-and-language representations reduces the rank of the
latent space, resulting in increased representing power for the semantics.
Extensive experiments on three benchmark text-video retrieval datasets prove
that our EMCL can learn more discriminative video-and-language representations
than previous methods, and significantly outperform previous state-of-the-art
methods across all metrics. More encouragingly, the proposed method can be
applied to boost the performance of existing approaches either as a jointly
training layer or an out-of-the-box inference module with no extra training,
making it easy to be incorporated into any existing methods.Comment: Accepted to NeurIPS 202
- …