112 research outputs found
Unpaired Image Captioning via Scene Graph Alignments
Most of current image captioning models heavily rely on paired image-caption
datasets. However, getting large scale image-caption paired data is
labor-intensive and time-consuming. In this paper, we present a scene
graph-based approach for unpaired image captioning. Our framework comprises an
image scene graph generator, a sentence scene graph generator, a scene graph
encoder, and a sentence decoder. Specifically, we first train the scene graph
encoder and the sentence decoder on the text modality. To align the scene
graphs between images and sentences, we propose an unsupervised feature
alignment method that maps the scene graph features from the image to the
sentence modality. Experimental results show that our proposed model can
generate quite promising results without using any image-caption training
pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201
Unsupervised Cross-lingual Image Captioning
Most recent image captioning works are conducted in English as the majority
of image-caption datasets are in English. However, there are a large amount of
non-native English speakers worldwide. Generating image captions in different
languages is worth exploring. In this paper, we present a novel unsupervised
method to generate image captions without using any caption corpus. Our method
relies on 1) a cross-lingual auto-encoding, which learns the scene graph
mapping function along with the scene graph encoders and sentence decoders on
machine translation parallel corpora, and 2) an unsupervised feature mapping,
which seeks to map the encoded scene graph features from image modality to
sentence modality. By leveraging cross-lingual auto-encoding, cross-modal
feature mapping, and adversarial learning, our method can learn an image
captioner to generate captions in different languages. We verify the
effectiveness of our proposed method on the Chinese image caption generation.
The comparisons against several baseline methods demonstrate the effectiveness
of our approach.Comment: 8 page
์ด๋ฏธ์ง ์บก์ ๋์ ์ํ ์ค๋งํธ ๋๋ค์ด๋ ์ด์ง ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. ์ด์๊ตฌ.Image captioning is a task in machine learning that aims to automatically generate a natural language description of a given image. It is considered a crucial task because of its broad applications and the fact that it is a bridge between computer vision and natural language processing.
However, image-caption paired dataset is restricted in both quantity and diversity, which is essential when training a supervised model. Various approaches have been made including semi-supervised and unsupervised learning, but the result is still far from that of supervised approach. While data augmentation can be the solution for data deficiency in the field, existing data augmentation techniques are often designed for image classification tasks and are not suitable for image captioning tasks.
Thus, in this paper, we introduce a new data augmentation technique designed for image captioning. The proposed Smart Random Erasing (SRE) is inspired from the Random Erasing augmentation technique, and it complements the drawbacks of Random Erasing to achieve the best performance boost when applied to image captioning. We also derive idea from AutoAugment to automatically search optimal hyperparameters via reinforcement learning. This study shows better results than the traditional augmentation techniques and the state-of-the-art augmentation technique RandAugment when applied to image captioning tasks.์ด๋ฏธ์ง ์บก์
๋์ด๋ ์
๋ ฅ์ด ์ด๋ฏธ์ง๋ก ์ฃผ์ด์ก์ ๋, ์ด๋ฏธ์ง์ ๋ํ ์์ฐ์ด ๋ฌ์ฌ๋ฅผ ์์ฑํ๋ ๋จธ์ ๋ฌ๋์ ํ ๊ณผ์ ์ด๋ค. ์ด๋ฏธ์ง ์บก์
๋์ ์๊ฐ์ฅ์ ์ธ์ ์ํ ๋ณด์กฐ์๋ง ์์ฑ, ์บก์
์์ฑ์ ํตํ ๊ฒ์์์ง ์ฑ๋ฅ ํฅ์ ๋ฑ ๋ฐฉ๋ํ ์ดํ๋ฆฌ์ผ์ด์
์ ๊ฐ์ง ๋ฟ๋ง ์๋๋ผ ์์ฐ์ด ์ฒ๋ฆฌ์ ์ปดํจํฐ ๋น์ ๋ถ์ผ๋ฅผ ์ฐ๊ฒฐํ๋ ๊ณผ์ ๋ก์ ์ค์์ฑ์ ์ง๋๊ณ ์๋ค.
ํ์ง๋ง, ์ด๋ฏธ์ง ์บก์
๋ ๋ชจ๋ธ์ ํ์ตํ๋๋ฐ ํ์ํ ์ด๋ฏธ์ง-์บก์
์ ์์ผ๋ก๋ ๋ฐ์ดํฐ์
์ ๋งค์ฐ ํ์ ๋์ด ์๊ณ , ํ์กดํ๋ ๋ฐ์ดํฐ์
๋ค ๋ํ ์์ฑ๋๋ ๋ฌธ์ฅ๋ค์ ๋ค์์ฑ์ด ๋ถ์กฑํ๋ฉฐ ์ด๋ฏธ์ง ๋ถ์ผ๋ ๋งค์ฐ ์ ํ์ ์ด๋ค. ์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ์ต๊ทผ์ ๋น์ง๋ ํ์ต ๋ชจ๋ธ์ ์ฐ๊ตฌ๋ ์งํ๋์์ผ๋, ํ์ฌ๋ก์๋ ์ง๋ ํ์ต ๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ฐ๋ผ๊ฐ๊ธฐ์ ์์ง ํ์ฐธ ๋ถ์กฑํ๋ค.
๋ฐ์ดํฐ ๋ถ์กฑ ๋ฌธ์ ๋ฅผ ์ํํ๊ธฐ ์ํ ๋ ๋ค๋ฅธ ๋ฐฉ๋ฒ์ผ๋ก๋ ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ์ด ์๋ค. ์ต๊ทผ ์ด๋ฏธ์ง ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ์ AutoAugment, RandAugment ๋ฑ ํ๋ฐํ๊ฒ ์ฐ๊ตฌ๊ฐ ์งํ๋๊ณ ์์ผ๋, ๋๋ถ๋ถ์ ์ฐ๊ตฌ๋ค์ด ์ด๋ฏธ์ง ๋ถ๋ฅ ๋ฌธ์ ๋ฅผ ์ํ ๊ธฐ๋ฒ๋ค์ด๊ณ , ์ด๋ฅผ ๊ทธ๋๋ก ์ด๋ฏธ์ง ์บก์
๋ ๋ฌธ์ ์ ์ ์ฉํ๊ธฐ์ ์ด๋ ค์์ด ์๋ค.
๋ฐ๋ผ์ ๋ณธ ์ฐ๊ตฌ์์๋ ์คํ์ ํตํด ๊ธฐ์กด์ ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ์ด ๋ฌธ์ , ๋ชจ๋ธ, ๋ฐ์ดํฐ์
์ ๋ฐ๋ผ ์ฑ๋ฅ์ด ๋งค์ฐ ๋ฌ๋ผ์ง๋ค๋ ๊ฒ์ ํ์ธํ๋ค. ๊ทธ๋ฆฌ๊ณ ๊ธฐ์กด์ ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ธฐ๋ฒ์ ๋ฐ์ ์์ผ ์ด๋ฏธ์ง ์บก์
๋ ๋ฌธ์ ์ ์ ํฉํ ์๋ก์ด ๊ธฐ๋ฒ์ ๊ฐ๋ฐํ๊ณ , ํด๋น ๊ธฐ๋ฒ์ ์ฑ๋ฅ์ ์คํ์ ์ผ๋ก ๊ฒ์ฆํ๋ค.Contents
Abstract........................................................... โ
ฐ
Contents ........................................................ โ
ฑi
Table Contents................................................... iv
Figure Contents .................................................. v
Chapter 1. Introduction........................................... 1
Chapter 2. Related Work ........................................ 3
2.1 Image Captioning Models.......................................................... 3
2.2 Image Data Augmentation Techniques..................................................... 5
Chapter 3. Smart Random Erasing .............................. 7
3.1 Object Recognition .................................................................... 8
3.2 Object Occlusion......................................................................... 9
3.3 Automatic Hyperparameter Search....................................... 11
Chapter 4. Experiments and Results........................... 13
4.1 Experimental Settings.............................................................. 13
4.2 Evaluation Metrics.................................................................... 14
4.3 Experiment Results and Analysis........................................... 16
4.3.1 Comparison with other DA techniques........................... 17
4.3.2 Comparison with original Random Erasing.................... 21
Chapter 5. Conclusion and Future Work...................... 22
References ...................................................... 24
์ด๋ก............................................................... 26Maste
Object-Centric Unsupervised Image Captioning
Image captioning is a longstanding problem in the field of computer vision
and natural language processing. To date, researchers have produced impressive
state-of-the-art performance in the age of deep learning. Most of these
state-of-the-art, however, requires large volume of annotated image-caption
pairs in order to train their models. When given an image dataset of interests,
practitioner needs to annotate the caption for each image in the training set
and this process needs to happen for each newly collected image dataset. In
this paper, we explore the task of unsupervised image captioning which utilizes
unpaired images and texts to train the model so that the texts can come from
different sources than the images. A main school of research on this topic that
has been shown to be effective is to construct pairs from the images and texts
in the training set according to their overlap of objects. Unlike in the
supervised setting, these constructed pairings are however not guaranteed to
have fully overlapping set of objects. Our work in this paper overcomes this by
harvesting objects corresponding to a given sentence from the training set,
even if they don't belong to the same image. When used as input to a
transformer, such mixture of objects enables larger if not full object
coverage, and when supervised by the corresponding sentence, produced results
that outperform current state of the art unsupervised methods by a significant
margin. Building upon this finding, we further show that (1) additional
information on relationship between objects and attributes of objects also
helps in boosting performance; and (2) our method also extends well to
non-English image captioning, which usually suffers from a scarcer level of
annotations. Our findings are supported by strong empirical results. Our code
is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
- โฆ