98 research outputs found

    Unpaired Image Captioning via Scene Graph Alignments

    Full text link
    Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graph-based approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we propose an unsupervised feature alignment method that maps the scene graph features from the image to the sentence modality. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.Comment: Accepted in ICCV 201

    Unsupervised Cross-lingual Image Captioning

    Full text link
    Most recent image captioning works are conducted in English as the majority of image-caption datasets are in English. However, there are a large amount of non-native English speakers worldwide. Generating image captions in different languages is worth exploring. In this paper, we present a novel unsupervised method to generate image captions without using any caption corpus. Our method relies on 1) a cross-lingual auto-encoding, which learns the scene graph mapping function along with the scene graph encoders and sentence decoders on machine translation parallel corpora, and 2) an unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to sentence modality. By leveraging cross-lingual auto-encoding, cross-modal feature mapping, and adversarial learning, our method can learn an image captioner to generate captions in different languages. We verify the effectiveness of our proposed method on the Chinese image caption generation. The comparisons against several baseline methods demonstrate the effectiveness of our approach.Comment: 8 page

    ์ด๋ฏธ์ง€ ์บก์…”๋‹์„ ์œ„ํ•œ ์Šค๋งˆํŠธ ๋žœ๋ค์ด๋ ˆ์ด์ง• ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์ด์ƒ๊ตฌ.Image captioning is a task in machine learning that aims to automatically generate a natural language description of a given image. It is considered a crucial task because of its broad applications and the fact that it is a bridge between computer vision and natural language processing. However, image-caption paired dataset is restricted in both quantity and diversity, which is essential when training a supervised model. Various approaches have been made including semi-supervised and unsupervised learning, but the result is still far from that of supervised approach. While data augmentation can be the solution for data deficiency in the field, existing data augmentation techniques are often designed for image classification tasks and are not suitable for image captioning tasks. Thus, in this paper, we introduce a new data augmentation technique designed for image captioning. The proposed Smart Random Erasing (SRE) is inspired from the Random Erasing augmentation technique, and it complements the drawbacks of Random Erasing to achieve the best performance boost when applied to image captioning. We also derive idea from AutoAugment to automatically search optimal hyperparameters via reinforcement learning. This study shows better results than the traditional augmentation techniques and the state-of-the-art augmentation technique RandAugment when applied to image captioning tasks.์ด๋ฏธ์ง€ ์บก์…”๋‹์ด๋ž€ ์ž…๋ ฅ์ด ์ด๋ฏธ์ง€๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ๋ฌ˜์‚ฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•œ ๊ณผ์ œ์ด๋‹ค. ์ด๋ฏธ์ง€ ์บก์…”๋‹์€ ์‹œ๊ฐ์žฅ์• ์ธ์„ ์œ„ํ•œ ๋ณด์กฐ์ž๋ง‰ ์ƒ์„ฑ, ์บก์…˜ ์ƒ์„ฑ์„ ํ†ตํ•œ ๊ฒ€์ƒ‰์—”์ง„ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋“ฑ ๋ฐฉ๋Œ€ํ•œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๊ฐ€์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ณผ์ œ๋กœ์„œ ์ค‘์š”์„ฑ์„ ์ง€๋‹ˆ๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ฏธ์ง€ ์บก์…”๋‹ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ์ด๋ฏธ์ง€-์บก์…˜์˜ ์Œ์œผ๋กœ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋งค์šฐ ํ•œ์ •๋˜์–ด ์žˆ๊ณ , ํ˜„์กดํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹๋“ค ๋˜ํ•œ ์ƒ์„ฑ๋˜๋Š” ๋ฌธ์žฅ๋“ค์˜ ๋‹ค์–‘์„ฑ์ด ๋ถ€์กฑํ•˜๋ฉฐ ์ด๋ฏธ์ง€ ๋ถ„์•ผ๋„ ๋งค์šฐ ์ œํ•œ์ ์ด๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ์—” ๋น„์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์˜ ์—ฐ๊ตฌ๋„ ์ง„ํ–‰๋˜์—ˆ์œผ๋‚˜, ํ˜„์žฌ๋กœ์„œ๋Š” ์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ๊ฐ€๊ธฐ์—” ์•„์ง ํ•œ์ฐธ ๋ถ€์กฑํ•˜๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์ด ์žˆ๋‹ค. ์ตœ๊ทผ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์€ AutoAugment, RandAugment ๋“ฑ ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜๊ณ  ์žˆ์œผ๋‚˜, ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋“ค์ด ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ๊ธฐ๋ฒ•๋“ค์ด๊ณ , ์ด๋ฅผ ๊ทธ๋Œ€๋กœ ์ด๋ฏธ์ง€ ์บก์…”๋‹ ๋ฌธ์ œ์— ์ ์šฉํ•˜๊ธฐ์—” ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์‹คํ—˜์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์ด ๋ฌธ์ œ, ๋ชจ๋ธ, ๋ฐ์ดํ„ฐ์…‹์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋งค์šฐ ๋‹ฌ๋ผ์ง„๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ๋ฐœ์ „์‹œ์ผœ ์ด๋ฏธ์ง€ ์บก์…”๋‹ ๋ฌธ์ œ์— ์ ํ•ฉํ•œ ์ƒˆ๋กœ์šด ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๊ณ , ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ์„ฑ๋Šฅ์„ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ•œ๋‹ค.Contents Abstract........................................................... โ…ฐ Contents ........................................................ โ…ฑi Table Contents................................................... iv Figure Contents .................................................. v Chapter 1. Introduction........................................... 1 Chapter 2. Related Work ........................................ 3 2.1 Image Captioning Models.......................................................... 3 2.2 Image Data Augmentation Techniques..................................................... 5 Chapter 3. Smart Random Erasing .............................. 7 3.1 Object Recognition .................................................................... 8 3.2 Object Occlusion......................................................................... 9 3.3 Automatic Hyperparameter Search....................................... 11 Chapter 4. Experiments and Results........................... 13 4.1 Experimental Settings.............................................................. 13 4.2 Evaluation Metrics.................................................................... 14 4.3 Experiment Results and Analysis........................................... 16 4.3.1 Comparison with other DA techniques........................... 17 4.3.2 Comparison with original Random Erasing.................... 21 Chapter 5. Conclusion and Future Work...................... 22 References ...................................................... 24 ์ดˆ๋ก............................................................... 26Maste

    Object-Centric Unsupervised Image Captioning

    Full text link
    Image captioning is a longstanding problem in the field of computer vision and natural language processing. To date, researchers have produced impressive state-of-the-art performance in the age of deep learning. Most of these state-of-the-art, however, requires large volume of annotated image-caption pairs in order to train their models. When given an image dataset of interests, practitioner needs to annotate the caption for each image in the training set and this process needs to happen for each newly collected image dataset. In this paper, we explore the task of unsupervised image captioning which utilizes unpaired images and texts to train the model so that the texts can come from different sources than the images. A main school of research on this topic that has been shown to be effective is to construct pairs from the images and texts in the training set according to their overlap of objects. Unlike in the supervised setting, these constructed pairings are however not guaranteed to have fully overlapping set of objects. Our work in this paper overcomes this by harvesting objects corresponding to a given sentence from the training set, even if they don't belong to the same image. When used as input to a transformer, such mixture of objects enables larger if not full object coverage, and when supervised by the corresponding sentence, produced results that outperform current state of the art unsupervised methods by a significant margin. Building upon this finding, we further show that (1) additional information on relationship between objects and attributes of objects also helps in boosting performance; and (2) our method also extends well to non-English image captioning, which usually suffers from a scarcer level of annotations. Our findings are supported by strong empirical results. Our code is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
    • โ€ฆ
    corecore