30 research outputs found

    Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

    Full text link
    It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8% to 67.2% on MSVD.Comment: AAAI 201

    Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

    Full text link
    We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at https://github.com/laura-wang/video_repres_mas.Comment: CVPR 201

    TAPER-WE: Transformer-Based Model Attention with Relative Position Encoding and Word Embedding for Video Captioning and Summarization in Dense Environment

    Get PDF
    In the era of burgeoning digital content, the need for automated video captioning and summarization in dense environments has become increasingly critical. This paper introduces TAPER-WE, a novel methodology for enhancing the performance of these tasks through the integration of state-of-the-art techniques. TAPER-WE leverages the power of Transformer-based models, incorporating advanced features such as Relative Position Encoding and Word Embedding. Our approach demonstrates substantial advancements in the domain of video captioning. By harnessing the contextual understanding abilities of Transformers, TAPER-WE excels in generating descriptive and contextually coherent captions for video frames. Furthermore, it provides a highly effective summarization mechanism, condensing lengthy videos into concise, informative summaries. One of the key innovations of TAPER-WE lies in its utilization of Relative Position Encoding, enabling the model to grasp temporal relationships within video sequences. This fosters accurate alignment between video frames and generated captions, resulting in superior captioning quality. Additionally, Word Embedding techniques enhance the model's grasp of semantics, enabling it to produce captions and summaries that are not only coherent but also linguistically rich. To validate the effectiveness of our proposed approach, we conducted extensive experiments on benchmark datasets, demonstrating significant improvements in captioning accuracy and summarization quality compared to existing methods. TAPER-WE not only achieves state-of-the-art performance but also showcases its adaptability and generalizability across a wide range of video content. In conclusion, TAPER-WE represents a substantial leap forward in the field of video captioning and summarization. Its amalgamation of Transformer-based architecture, Relative Position Encoding, and Word Embedding empowers it to produce captions and summaries that are not only informative but also contextually aware, addressing the growing need for efficient content understanding in the digital age
    corecore