594 research outputs found

    Learning Combinatorial Prompts for Universal Controllable Image Captioning

    Full text link
    Controllable Image Captioning (CIC) -- generating natural language descriptions about images under the guidance of given control signals -- is one of the most promising directions towards next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By ``combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can easily be extended to more complex combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals

    A Comprehensive Survey of Automated Audio Captioning

    Full text link
    Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning

    Towards Interaction-level Video Action Understanding

    Get PDF
    A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding

    Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

    Full text link
    In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.Comment: to be appeared in ICCV202

    Dual Attention on Pyramid Feature Maps for Image Captioning

    Full text link
    Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.Comment: in IEEE Transactions on Multimedia, 202

    Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation

    Full text link
    Multi-modal 3D scene understanding has gained considerable attention due to its wide applications in many areas, such as autonomous driving and human-computer interaction. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also ensures a more robust and resilient understanding. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over past three years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this article, we present a systematic survey of recent progress to bridge this gap. We begin by briefly introducing a background that formally defines various 3D multi-modal tasks and summarizes their inherent challenges. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research

    Evaluating the Performance of Transformer architecture over Attention architecture on Image Captioning

    Get PDF
    Over the last few decades computer vision and Natural Language processing has shown tremendous improvement in different tasks such as image captioning, video captioning, machine translation etc using deep learning models. However, there were not much researches related to image captioning based on transformers and how it outperforms other models that were implemented for image captioning. In this study will be designing a simple encoder-decoder model, attention model and transformer model for image captioning using Flickr8K dataset where will be discussing about the hyperparameters of the model, type of pre-trained model used and how long the model has been trained. Furthermore, will be comparing the captions generated by attention model and transformer model using BLEU score metrics, which will be further analysed using human evaluation conducted using intrinsic approach. After analysis of results obtained using statistical test conducted on BLEU score metrics and human evaluation it was found that transformer model with multi-head attention has outperformed attention model in image captioning

    Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

    Full text link
    Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.Comment: Accepted to NeurIPS 202
    corecore