Search CORE

112 research outputs found

Unpaired Image Captioning via Scene Graph Alignments

Author: Cai Jianfei
Gu Jiuxiang
Joty Shafiq
Wang Gang
Yang Xu
Zhao Handong
Publication venue
Publication date: 01/01/2019
Field of study

Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201

arXiv.org e-Print Archive

Crossref

Monash University Research Portal

Unsupervised Cross-lingual Image Captioning

Author: Gao Jiahui
Gu Jiuxiang
Joty Shafiq
Yu Philip L. H.
Zhou Yi
Publication venue
Publication date: 03/10/2020
Field of study

Most recent image captioning works are conducted in English as the majority of image-caption datasets are in English. However, there are a large amount of non-native English speakers worldwide. Generating image captions in different languages is worth exploring. In this paper, we present a novel unsupervised method to generate image captions without using any caption corpus. Our method relies on 1) a cross-lingual auto-encoding, which learns the scene graph mapping function along with the scene graph encoders and sentence decoders on machine translation parallel corpora, and 2) an unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to sentence modality. By leveraging cross-lingual auto-encoding, cross-modal feature mapping, and adversarial learning, our method can learn an image captioner to generate captions in different languages. We verify the effectiveness of our proposed method on the Chinese image caption generation. The comparisons against several baseline methods demonstrate the effectiveness of our approach.Comment: 8 page

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

이미지 캡셔닝을 위한 스마트 랜덤이레이징 데이터 증강 기법

Author: 김연우
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 이상구.Image captioning is a task in machine learning that aims to automatically generate a natural language description of a given image. It is considered a crucial task because of its broad applications and the fact that it is a bridge between computer vision and natural language processing. However, image-caption paired dataset is restricted in both quantity and diversity, which is essential when training a supervised model. Various approaches have been made including semi-supervised and unsupervised learning, but the result is still far from that of supervised approach. While data augmentation can be the solution for data deficiency in the field, existing data augmentation techniques are often designed for image classification tasks and are not suitable for image captioning tasks. Thus, in this paper, we introduce a new data augmentation technique designed for image captioning. The proposed Smart Random Erasing (SRE) is inspired from the Random Erasing augmentation technique, and it complements the drawbacks of Random Erasing to achieve the best performance boost when applied to image captioning. We also derive idea from AutoAugment to automatically search optimal hyperparameters via reinforcement learning. This study shows better results than the traditional augmentation techniques and the state-of-the-art augmentation technique RandAugment when applied to image captioning tasks.이미지 캡셔닝이란 입력이 이미지로 주어졌을 때, 이미지에 대한 자연어 묘사를 생성하는 머신러닝의 한 과제이다. 이미지 캡셔닝은 시각장애인을 위한 보조자막 생성, 캡션 생성을 통한 검색엔진 성능 향상 등 방대한 어플리케이션을 가질 뿐만 아니라 자연어 처리와 컴퓨터 비전 분야를 연결하는 과제로서 중요성을 지니고 있다. 하지만, 이미지 캡셔닝 모델을 학습하는데 필요한 이미지-캡션의 쌍으로된 데이터셋은 매우 한정되어 있고, 현존하는 데이터셋들 또한 생성되는 문장들의 다양성이 부족하며 이미지 분야도 매우 제한적이다. 이를 해결하기 위해 최근엔 비지도 학습 모델의 연구도 진행되었으나, 현재로서는 지도 학습 모델의 성능을 따라가기엔 아직 한참 부족하다. 데이터 부족 문제를 완화하기 위한 또 다른 방법으로는 데이터 증강 기법이 있다. 최근 이미지 데이터 증강 기법은 AutoAugment, RandAugment 등 활발하게 연구가 진행되고 있으나, 대부분의 연구들이 이미지 분류 문제를 위한 기법들이고, 이를 그대로 이미지 캡셔닝 문제에 적용하기엔 어려움이 있다. 따라서 본 연구에서는 실험을 통해 기존의 데이터 증강 기법이 문제, 모델, 데이터셋에 따라 성능이 매우 달라진다는 것을 확인한다. 그리고 기존의 데이터 증강 기법을 발전시켜 이미지 캡셔닝 문제에 적합한 새로운 기법을 개발하고, 해당 기법의 성능을 실험적으로 검증한다.Contents Abstract........................................................... ⅰ Contents ........................................................ ⅱi Table Contents................................................... iv Figure Contents .................................................. v Chapter 1. Introduction........................................... 1 Chapter 2. Related Work ........................................ 3 2.1 Image Captioning Models.......................................................... 3 2.2 Image Data Augmentation Techniques..................................................... 5 Chapter 3. Smart Random Erasing .............................. 7 3.1 Object Recognition .................................................................... 8 3.2 Object Occlusion......................................................................... 9 3.3 Automatic Hyperparameter Search....................................... 11 Chapter 4. Experiments and Results........................... 13 4.1 Experimental Settings.............................................................. 13 4.2 Evaluation Metrics.................................................................... 14 4.3 Experiment Results and Analysis........................................... 16 4.3.1 Comparison with other DA techniques........................... 17 4.3.2 Comparison with original Random Erasing.................... 21 Chapter 5. Conclusion and Future Work...................... 22 References ...................................................... 24 초록............................................................... 26Maste

SNU Open Repository and Archive

Object-Centric Unsupervised Image Captioning

Author: Cao Xuefei
Lim Ser-Nam
Meng Zihang
Shah Ashish
Yang David
Publication venue
Publication date: 19/07/2022
Field of study

Image captioning is a longstanding problem in the field of computer vision and natural language processing. To date, researchers have produced impressive state-of-the-art performance in the age of deep learning. Most of these state-of-the-art, however, requires large volume of annotated image-caption pairs in order to train their models. When given an image dataset of interests, practitioner needs to annotate the caption for each image in the training set and this process needs to happen for each newly collected image dataset. In this paper, we explore the task of unsupervised image captioning which utilizes unpaired images and texts to train the model so that the texts can come from different sources than the images. A main school of research on this topic that has been shown to be effective is to construct pairs from the images and texts in the training set according to their overlap of objects. Unlike in the supervised setting, these constructed pairings are however not guaranteed to have fully overlapping set of objects. Our work in this paper overcomes this by harvesting objects corresponding to a given sentence from the training set, even if they don't belong to the same image. When used as input to a transformer, such mixture of objects enables larger if not full object coverage, and when supervised by the corresponding sentence, produced results that outperform current state of the art unsupervised methods by a significant margin. Building upon this finding, we further show that (1) additional information on relationship between objects and attributes of objects also helps in boosting performance; and (2) our method also extends well to non-English image captioning, which usually suffers from a scarcer level of annotations. Our findings are supported by strong empirical results. Our code is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202

arXiv.org e-Print Archive