154 research outputs found
Evaluating the Performance of Transformer architecture over Attention architecture on Image Captioning
Over the last few decades computer vision and Natural Language processing has shown tremendous improvement in different tasks such as image captioning, video captioning, machine translation etc using deep learning models. However, there were not much researches related to image captioning based on transformers and how it outperforms other models that were implemented for image captioning. In this study will be designing a simple encoder-decoder model, attention model and transformer model for image captioning using Flickr8K dataset where will be discussing about the hyperparameters of the model, type of pre-trained model used and how long the model has been trained. Furthermore, will be comparing the captions generated by attention model and transformer model using BLEU score metrics, which will be further analysed using human evaluation conducted using intrinsic approach. After analysis of results obtained using statistical test conducted on BLEU score metrics and human evaluation it was found that transformer model with multi-head attention has outperformed attention model in image captioning
Image Captioning through Image Transformer
Automatic captioning of images is a task that combines the challenges of
image analysis and text generation. One important aspect in captioning is the
notion of attention: How to decide what to describe and in which order.
Inspired by the successes in text analysis and translation, previous work have
proposed the \textit{transformer} architecture for image captioning. However,
the structure between the \textit{semantic units} in images (usually the
detected regions from object detection model) and sentences (each single word)
is different. Limited work has been done to adapt the transformer's internal
architecture to images. In this work, we introduce the \textbf{\textit{image
transformer}}, which consists of a modified encoding transformer and an
implicit decoding transformer, motivated by the relative spatial relationship
between image regions. Our design widen the original transformer layer's inner
architecture to adapt to the structure of images. With only regions feature as
inputs, our model achieves new state-of-the-art performance on both MSCOCO
offline and online testing benchmarks
Image Captioning with Context-Aware Auxiliary Guidance
Image captioning is a challenging computer vision task, which aims to
generate a natural language description of an image. Most recent researches
follow the encoder-decoder framework which depends heavily on the previous
generated words for the current prediction. Such methods can not effectively
take advantage of the future predicted information to learn complete semantics.
In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism
that can guide the captioning model to perceive global contexts. Upon the
captioning model, CAAG performs semantic attention that selectively
concentrates on useful information of the global predictions to reproduce the
current generation. To validate the adaptability of the method, we apply CAAG
to three popular captioners and our proposal achieves competitive performance
on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2
CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official
online evaluation server
Synchronized audio-visual frames with fractional positional encoding for transformers in video-to-text translation
Video-to-Text (VTT) is the task of automatically generating descriptions for
short audio-visual video clips, which can support visually impaired people to
understand scenes of a YouTube video for instance. Transformer architectures
have shown great performance in both machine translation and image captioning,
lacking a straightforward and reproducible application for VTT. However, there
is no comprehensive study on different strategies and advice for video
description generation including exploiting the accompanying audio with fully
self-attentive networks. Thus, we explore promising approaches from image
captioning and video processing and apply them to VTT by developing a
straightforward Transformer architecture. Additionally, we present a novel way
of synchronizing audio and video features in Transformers which we call
Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX
dataset to determine a configuration applicable to unseen datasets that helps
describe short video clips in natural language and improved the CIDEr and
BLEU-4 scores by 37.13 and 12.83 points compared to a vanilla Transformer
network and achieve state-of-the-art results on the MSR-VTT and MSVD datasets.
Also, FPE helps increase the CIDEr score by a relative factor of 8.6%
Vision Transformer Based Model for Describing a Set of Images as a Story
Visual Story-Telling is the process of forming a multi-sentence story from a
set of images. Appropriately including visual variation and contextual
information captured inside the input images is one of the most challenging
aspects of visual storytelling. Consequently, stories developed from a set of
images often lack cohesiveness, relevance, and semantic relationship. In this
paper, we propose a novel Vision Transformer Based Model for describing a set
of images as a story. The proposed method extracts the distinct features of the
input images using a Vision Transformer (ViT). Firstly, input images are
divided into 16X16 patches and bundled into a linear projection of flattened
patches. The transformation from a single image to multiple image patches
captures the visual variety of the input visual patterns. These features are
used as input to a Bidirectional-LSTM which is part of the sequence encoder.
This captures the past and future image context of all image patches. Then, an
attention mechanism is implemented and used to increase the discriminatory
capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The
performance of our proposed model is evaluated using the Visual Story-Telling
dataset (VIST), and the results show that our model outperforms the current
state of the art models.Comment: This paper has been accepted at the 35th Australasian Joint
Conference on Artificial Intelligence 2022 (Camera-ready version is attached
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
While impressive performance has been achieved in image captioning, the
limited diversity of the generated captions and the large parameter scale
remain major barriers to the real-word application of these systems. In this
work, we propose a lightweight image captioning network in combination with
continuous diffusion, called Prefix-diffusion. To achieve diversity, we design
an efficient method that injects prefix image embeddings into the denoising
process of the diffusion model. In order to reduce trainable parameters, we
employ a pre-trained model to extract image features and further design an
extra mapping network. Prefix-diffusion is able to generate diverse captions
with relatively less parameters, while maintaining the fluency and relevance of
the captions benefiting from the generative capabilities of the diffusion
model. Our work paves the way for scaling up diffusion models for image
captioning, and achieves promising performance compared with recent approaches.Comment: 11 pages,4 figures, 6 table
Efficient Modeling of Future Context for Image Captioning
Existing approaches to image captioning usually generate the sentence
word-by-word from left to right, with the constraint of conditioned on local
context including the given image and history generated words. There have been
many studies target to make use of global information during decoding, e.g.,
iterative refinement. However, it is still under-explored how to effectively
and efficiently incorporate the future context. To respond to this issue,
inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage
two-side relation with modified mask operation, we aim to graft this advance to
the conventional Autoregressive Image Captioning (AIC) model while maintaining
the inference efficiency without extra time cost. Specifically, AIC and NAIC
models are first trained combined with shared visual encoders, forcing the
visual encoder to contain sufficient and valid future context; then the AIC
model is encouraged to capture the causal dynamics of cross-layer interchanging
from NAIC model on its unconfident words, which follows a teacher-student
paradigm and optimized with the distribution calibration training objective.
Empirical evidences demonstrate that our proposed approach clearly surpass the
state-of-the-art baselines in both automatic metrics and human evaluations on
the MS COCO benchmark. The source code is available at:
https://github.com/feizc/Future-Caption.Comment: ACM Multimedia 202
Rethinking the Reference-based Distinctive Image Captioning
Distinctive Image Captioning (DIC) -- generating distinctive captions that
describe the unique details of a target image -- has received considerable
attention over the last few years. A recent DIC work proposes to generate
distinctive captions by comparing the target image with a set of
semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims
to make the generated captions can tell apart the target and reference images.
Unfortunately, reference images used by existing Ref-DIC works are easy to
distinguish: these reference images only resemble the target image at
scene-level and have few common objects, such that a Ref-DIC model can
trivially generate distinctive captions even without considering the reference
images. To ensure Ref-DIC models really perceive the unique objects (or
attributes) in target images, we first propose two new Ref-DIC benchmarks.
Specifically, we design a two-stage matching mechanism, which strictly controls
the similarity between the target and reference images at object-/attribute-
level (vs. scene-level). Secondly, to generate distinctive captions, we develop
a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only
extracts visual features from the target image, but also encodes the
differences between objects in the target and reference images. Finally, for
more trustworthy benchmarking, we propose a new evaluation metric named
DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of
the generated captions. Experimental results demonstrate that our TransDIC can
generate distinctive captions. Besides, it outperforms several state-of-the-art
models on the two new benchmarks over different metrics.Comment: ACM MM 202
- …