11 research outputs found

    Image Captioning through Image Transformer

    Get PDF
    Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks

    Image Captioning through Image Transformer

    Get PDF
    Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect of captioning is the notion of attention: how to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous works have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer’s internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widens the original transformer layer’s inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks. The code is available at https://github.com/wtliao/ImageTransformer

    Channel and spatial attention mechanism for fashion image captioning

    Get PDF
    Image captioning aims to automatically generate one or more description sentences for a given input image. Most of the existing captioning methods use encoder-decoder model which mainly focus on recognizing and capturing the relationship between objects appearing in the input image. However, when generating captions for fashion images, it is important to not only describe the items and their relationships, but also mention attribute features of clothes (shape, texture, style, fabric, and more). In this study, one novel model is proposed for fashion image captioning task which can capture not only the items and their relationship, but also their attribute features. Two different attention mechanisms (spatial-attention and channel-wise attention) is incorporated to the traditional encoder-decoder model, which dynamically interprets the caption sentence in multi-layer feature map in addition to the depth dimension of the feature map. We evaluate our proposed architecture on Fashion-Gen using three different metrics (CIDEr, ROUGE-L, and BLEU-1), and achieve the scores of 89.7, 50.6 and 45.6, respectively. Based on experiments, our proposed method shows significant performance improvement for the task of fashion-image captioning, and outperforms other state-of-the-art image captioning methods

    AMENet: Attentive Maps Encoder Network for Trajectory Prediction

    Get PDF
    Trajectory prediction is critical for applications of planning safe future movements and remains challenging even for the next few seconds in urban mixed traffic. How an agent moves is affected by the various behaviors of its neighboring agents in different environments. To predict movements, we propose an end-to-end generative model named Attentive Maps Encoder Network (AMENet) that encodes the agent's motion and interaction information for accurate and realistic multi-path trajectory prediction. A conditional variational auto-encoder module is trained to learn the latent space of possible future paths based on attentive dynamic maps for interaction modeling and then is used to predict multiple plausible future trajectories conditioned on the observed past trajectories. The efficacy of AMENet is validated using two public trajectory prediction benchmarks Trajnet and InD.Comment: Accepted by ISPRS Journal of Photogrammetry and Remote Sensin

    Attention in Computer Vision

    Get PDF
    Thanks to deep learning, computer vision has advanced by a large margin. Attention mechanism, inspired from human vision system and acts as a versatile module or mechanism that widely applied in the current deep computer vision models, strengthens the power of deep models. However, most attention models have been trained end-to-end. Why and how those attention models work? How similar is the trained attention to the human attention where it was inspired? Those questions are still unknown to us, which thus hinders us to design a better attention model, architecture or algorithm that can further advance the computer vision field. In this thesis, we aim to unravel the mysterious attention models by studying attention mechanisms in computer vision during the deep learning era. In the first part of this thesis, we study bottom-up attention. Under the umbrella of saliency prediction, bottom-up attention has progressed a lot with the help of deep learning. However, the deep saliency models are still a black box to us and their performance has reached a ceiling. Therefore, the first part of this thesis aims to understand what happened inside the deep models when it is trained for saliency prediction. Concretely, this thesis dissected each individual unit inside a deep model that has been trained for saliency prediction. Our analysis discloses the secrets of deep models for saliency prediction as well as their limitations, and give new insights for future saliency modelling. In the second part, we study top-down attention in computer vision. Top-down attention, a mechanism usually builds on top of bottom-up attention, has achieved great success in a lot of computer vision tasks. However, their success raised an interesting question, namely, ``are those learned top-down attention similar to human attention under the same task?''. To answer this question, we have collected a dataset which recorded human attention under the image captioning task. Using our collected dataset, we analyse what is the difference between attention exploited by a deep model for image captioning and human attention under the same task. Our research shows that current widely used soft attention mechanism is different from human attention under the same task. In the meanwhile, we use human attention, as a prior knowledge, to help machine to perform better in the image captioning task. In the third part, we study contextual attention. It is a complementary part to both bottom-up and top-down attention, which contextualizes each informative region with attention. Prior contextual attention methods either adopt the contextual module in natural language processing that is only suitable for 1-D sequential inputs or complex two stream graph neural networks. Motivated by the difference of semantic units between sentences and images, we designed a transformer based architecture for image captioning. Our design widens original transformer layer by using the 2-D spatial relationship and achieves competitive performance for image captioning

    β\beta-Variational autoencoders and transformers for reduced-order modelling of fluid flows

    Full text link
    Variational autoencoder (VAE) architectures have the potential to develop reduced-order models (ROMs) for chaotic fluid flows. We propose a method for learning compact and near-orthogonal ROMs using a combination of a β\beta-VAE and a transformer, tested on numerical data from a two-dimensional viscous flow in both periodic and chaotic regimes. The β\beta-VAE is trained to learn a compact latent representation of the flow velocity, and the transformer is trained to predict the temporal dynamics in latent space. Using the β\beta-VAE to learn disentangled representations in latent-space, we obtain a more interpretable flow model with features that resemble those observed in the proper orthogonal decomposition, but with a more efficient representation. Using Poincar\'e maps, the results show that our method can capture the underlying dynamics of the flow outperforming other prediction models. The proposed method has potential applications in other fields such as weather forecasting, structural dynamics or biomedical engineering

    Evaluating the Performance of Transformer architecture over Attention architecture on Image Captioning

    Get PDF
    Over the last few decades computer vision and Natural Language processing has shown tremendous improvement in different tasks such as image captioning, video captioning, machine translation etc using deep learning models. However, there were not much researches related to image captioning based on transformers and how it outperforms other models that were implemented for image captioning. In this study will be designing a simple encoder-decoder model, attention model and transformer model for image captioning using Flickr8K dataset where will be discussing about the hyperparameters of the model, type of pre-trained model used and how long the model has been trained. Furthermore, will be comparing the captions generated by attention model and transformer model using BLEU score metrics, which will be further analysed using human evaluation conducted using intrinsic approach. After analysis of results obtained using statistical test conducted on BLEU score metrics and human evaluation it was found that transformer model with multi-head attention has outperformed attention model in image captioning

    A transformer-based Urdu image caption generation

    Get PDF
    Image caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings

    Graph neural networks in vision-language image understanding: a survey

    Get PDF
    Abstract2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture. </jats:p
    corecore