12,097 research outputs found
Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Recent progress has been made in using attention based encoder-decoder
framework for video captioning. However, most existing decoders apply the
attention mechanism to every generated word including both visual words (e.g.,
"gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these
non-visual words can be easily predicted using natural language model without
considering visual signals or attention. Imposing attention mechanism on
non-visual words could mislead and decrease the overall performance of video
captioning. To address this issue, we propose a hierarchical LSTM with adjusted
temporal attention (hLSTMat) approach for video captioning. Specifically, the
proposed framework utilizes the temporal attention for selecting specific
frames to predict the related words, while the adjusted temporal attention is
for deciding whether to depend on the visual information or the language
context information. Also, a hierarchical LSTMs is designed to simultaneously
consider both low-level visual information and high-level language context
information to support the video caption generation. To demonstrate the
effectiveness of our proposed framework, we test our method on two prevalent
datasets: MSVD and MSR-VTT, and experimental results show that our approach
outperforms the state-of-the-art methods on both two datasets
Areas of Attention for Image Captioning
We propose "Areas of Attention", a novel attention-based model for automatic
image captioning. Our approach models the dependencies between image regions,
caption words, and the state of an RNN language model, using three pairwise
interactions. In contrast to previous attention-based approaches that associate
image regions only to the RNN state, our method allows a direct association
between caption words and image regions. During training these associations are
inferred from image-level captions, akin to weakly-supervised object detector
training. These associations help to improve captioning by localizing the
corresponding regions during testing. We also propose and compare different
ways of generating attention areas: CNN activation grids, object proposals, and
spatial transformers nets applied in a convolutional fashion. Spatial
transformers give the best results. They allow for image specific attention
areas, and can be trained jointly with the rest of the network. Our attention
mechanism and spatial transformer attention areas together yield
state-of-the-art results on the MSCOCO dataset.o meaningful latent semantic
structure in the generated captions.Comment: Accepted in ICCV 201
Architectures of Leading Image Captioning Systems
Image captioning is the task of generating a natural language description of an image. The task requires techniques from two research areas, computer vision and natural language generation.
This thesis investigates the architectures of leading image captioning systems. The research question is: What components and architectures are used in state-of-the-art image captioning systems and how could image captioning systems be further improved by utilizing improved components and architectures? Five openly reported leading image captioning systems are investigated in detail: Attention on Attention, the Meshed-Memory Transformer, the X-Linear Attention Network, the Show, Edit and Tell method, and Prophet Attention.
The investigated leading image captioners all rely on the same object detector, the Faster R-CNN based Bottom-Up object detection network. Four out of five also rely on the same backbone convolutional neural network, ResNet-101. Both the backbone and the object detector could be improved by using newer approaches. Best choice in CNN-based object detectors is the EfficientDet with an EfficientNet backbone. A completely transformer-based approach with a Vision Transformer backbone and a Detection Transformer object detector is a fast-developing alternative. The main area of variation between the leading image captioners is in the types of attention blocks used in the high-level image encoder, the type of natural language decoder and the connections between these components. The best architectures and attention approaches to implement these components are currently the Meshed-Memory Transformer and the bilinear pooling approach of the X-Linear Attention Network. Implementing the Prophet Attention approach of using the future words available in the supervised training phase to guide the decoder attention further improves performance.
Pretraining the backbone using large image datasets is essential to reach semantically correct object detections and object features. The feature richness and dense annotation of data is equally important in training the object detector
Evaluating the Performance of Transformer architecture over Attention architecture on Image Captioning
Over the last few decades computer vision and Natural Language processing has shown tremendous improvement in different tasks such as image captioning, video captioning, machine translation etc using deep learning models. However, there were not much researches related to image captioning based on transformers and how it outperforms other models that were implemented for image captioning. In this study will be designing a simple encoder-decoder model, attention model and transformer model for image captioning using Flickr8K dataset where will be discussing about the hyperparameters of the model, type of pre-trained model used and how long the model has been trained. Furthermore, will be comparing the captions generated by attention model and transformer model using BLEU score metrics, which will be further analysed using human evaluation conducted using intrinsic approach. After analysis of results obtained using statistical test conducted on BLEU score metrics and human evaluation it was found that transformer model with multi-head attention has outperformed attention model in image captioning
Image Captioning based on Feature Refinement and Reflective Decoding
Image captioning is the process of automatically generating a description of
an image in natural language. Image captioning is one of the significant
challenges in image understanding since it requires not only recognizing
salient objects in the image but also their attributes and the way they
interact. The system must then generate a syntactically and semantically
correct caption that describes the image content in natural language. With the
significant progress in deep learning models and their ability to effectively
encode large sets of images and generate correct sentences, several
neural-based captioning approaches have been proposed recently, each trying to
achieve better accuracy and caption quality. This paper introduces an
encoder-decoder-based image captioning system in which the encoder extracts
spatial features from the image using ResNet-101. This stage is followed by a
refining model, which uses an attention-on-attention mechanism to extract the
visual features of the target image objects, then determine their interactions.
The decoder consists of an attention-based recurrent module and a reflective
attention module, which collaboratively apply attention to the visual and
textual features to enhance the decoder's ability to model long-term sequential
dependencies. Extensive experiments performed on Flickr30K, show the
effectiveness of the proposed approach and the high quality of the generated
captions
The effects of captioning videos used for foreign language listening activities
This study investigated the effects of captioning during video-based listening activities. Second- and fourth-year learners of Arabic, Chinese, Spanish, and Russian watched three short videos with and without captioning in randomized order. Spanish learners had two additional groups: one watched the videos twice with no captioning, and another watched them twice with captioning. After the second showing of the video, learners took comprehension and vocabulary tests based on the video. Twenty-six learners participated in interviews following the actual experiment. They were asked about their general reactions to the videos (captioned and noncaptioned). Results from t-tests and two-way ANOVAs indicated that captioning was more effective than no captioning. Captioning during the first showing of the videos was more effective for performance on aural vocabulary tests. For Spanish and Russian, captioning first was generally more effective than captioning second; while for Arabic and Chinese, there was a trend toward captioning second being more effective. The interview data revealed that learners used captions to increase their attention, improve processing, reinforce previous knowledge, and analyze language. Learners also reported using captions as a crutch
- …