7 research outputs found

    이미지 캡셔닝을 위한 스마트 랜덤이레이징 데이터 증강 기법

    Get PDF
    학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 이상구.Image captioning is a task in machine learning that aims to automatically generate a natural language description of a given image. It is considered a crucial task because of its broad applications and the fact that it is a bridge between computer vision and natural language processing. However, image-caption paired dataset is restricted in both quantity and diversity, which is essential when training a supervised model. Various approaches have been made including semi-supervised and unsupervised learning, but the result is still far from that of supervised approach. While data augmentation can be the solution for data deficiency in the field, existing data augmentation techniques are often designed for image classification tasks and are not suitable for image captioning tasks. Thus, in this paper, we introduce a new data augmentation technique designed for image captioning. The proposed Smart Random Erasing (SRE) is inspired from the Random Erasing augmentation technique, and it complements the drawbacks of Random Erasing to achieve the best performance boost when applied to image captioning. We also derive idea from AutoAugment to automatically search optimal hyperparameters via reinforcement learning. This study shows better results than the traditional augmentation techniques and the state-of-the-art augmentation technique RandAugment when applied to image captioning tasks.이미지 캡셔닝이란 입력이 이미지로 주어졌을 때, 이미지에 대한 자연어 묘사를 생성하는 머신러닝의 한 과제이다. 이미지 캡셔닝은 시각장애인을 위한 보조자막 생성, 캡션 생성을 통한 검색엔진 성능 향상 등 방대한 어플리케이션을 가질 뿐만 아니라 자연어 처리와 컴퓨터 비전 분야를 연결하는 과제로서 중요성을 지니고 있다. 하지만, 이미지 캡셔닝 모델을 학습하는데 필요한 이미지-캡션의 쌍으로된 데이터셋은 매우 한정되어 있고, 현존하는 데이터셋들 또한 생성되는 문장들의 다양성이 부족하며 이미지 분야도 매우 제한적이다. 이를 해결하기 위해 최근엔 비지도 학습 모델의 연구도 진행되었으나, 현재로서는 지도 학습 모델의 성능을 따라가기엔 아직 한참 부족하다. 데이터 부족 문제를 완화하기 위한 또 다른 방법으로는 데이터 증강 기법이 있다. 최근 이미지 데이터 증강 기법은 AutoAugment, RandAugment 등 활발하게 연구가 진행되고 있으나, 대부분의 연구들이 이미지 분류 문제를 위한 기법들이고, 이를 그대로 이미지 캡셔닝 문제에 적용하기엔 어려움이 있다. 따라서 본 연구에서는 실험을 통해 기존의 데이터 증강 기법이 문제, 모델, 데이터셋에 따라 성능이 매우 달라진다는 것을 확인한다. 그리고 기존의 데이터 증강 기법을 발전시켜 이미지 캡셔닝 문제에 적합한 새로운 기법을 개발하고, 해당 기법의 성능을 실험적으로 검증한다.Contents Abstract........................................................... ⅰ Contents ........................................................ ⅱi Table Contents................................................... iv Figure Contents .................................................. v Chapter 1. Introduction........................................... 1 Chapter 2. Related Work ........................................ 3 2.1 Image Captioning Models.......................................................... 3 2.2 Image Data Augmentation Techniques..................................................... 5 Chapter 3. Smart Random Erasing .............................. 7 3.1 Object Recognition .................................................................... 8 3.2 Object Occlusion......................................................................... 9 3.3 Automatic Hyperparameter Search....................................... 11 Chapter 4. Experiments and Results........................... 13 4.1 Experimental Settings.............................................................. 13 4.2 Evaluation Metrics.................................................................... 14 4.3 Experiment Results and Analysis........................................... 16 4.3.1 Comparison with other DA techniques........................... 17 4.3.2 Comparison with original Random Erasing.................... 21 Chapter 5. Conclusion and Future Work...................... 22 References ...................................................... 24 초록............................................................... 26Maste

    Unsupervised Cross-lingual Image Captioning

    Full text link
    Most recent image captioning works are conducted in English as the majority of image-caption datasets are in English. However, there are a large amount of non-native English speakers worldwide. Generating image captions in different languages is worth exploring. In this paper, we present a novel unsupervised method to generate image captions without using any caption corpus. Our method relies on 1) a cross-lingual auto-encoding, which learns the scene graph mapping function along with the scene graph encoders and sentence decoders on machine translation parallel corpora, and 2) an unsupervised feature mapping, which seeks to map the encoded scene graph features from image modality to sentence modality. By leveraging cross-lingual auto-encoding, cross-modal feature mapping, and adversarial learning, our method can learn an image captioner to generate captions in different languages. We verify the effectiveness of our proposed method on the Chinese image caption generation. The comparisons against several baseline methods demonstrate the effectiveness of our approach.Comment: 8 page

    Text-image synergy for multimodal retrieval and annotation

    Get PDF
    Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.Text and images are the two most common data modalities found on the Internet. Understanding the synergy between text and images, that is, seamlessly analyzing information from these modalities may be trivial for humans, but is challenging for software systems. In this dissertation we study problems where deciphering text-image synergy is crucial for finding solutions. We propose methods and ideas that establish semantic connections between text and images in multimodal contents, and empirically show their effectiveness in four interconnected problems: Image Retrieval, Image Tag Refinement, Image-Text Alignment, and Image Captioning. Our promising results and observations open up interesting scopes for future research involving text-image data understanding.Text und Bild sind die beiden häufigsten Arten von Inhalten im Internet. Während es für Menschen einfach ist, gerade aus dem Zusammenspiel von Text- und Bildinhalten Informationen zu erfassen, stellt diese kombinierte Darstellung von Inhalten Softwaresysteme vor große Herausforderungen. In dieser Dissertation werden Probleme studiert, für deren Lösung das Verständnis des Zusammenspiels von Text- und Bildinhalten wesentlich ist. Es werden Methoden und Vorschläge präsentiert und empirisch bewertet, die semantische Verbindungen zwischen Text und Bild in multimodalen Daten herstellen. Wir stellen in dieser Dissertation vier miteinander verbundene Text- und Bildprobleme vor: • Bildersuche. Ob Bilder anhand von textbasierten Suchanfragen gefunden werden, hängt stark davon ab, ob der Text in der Nähe des Bildes mit dem der Anfrage übereinstimmt. Bilder ohne textuellen Kontext, oder sogar mit thematisch passendem Kontext, aber ohne direkte Übereinstimmungen der vorhandenen Schlagworte zur Suchanfrage, können häufig nicht gefunden werden. Zur Abhilfe schlagen wir vor, drei Arten von Informationen in Kombination zu nutzen: visuelle Informationen (in Form von automatisch generierten Bildbeschreibungen), textuelle Informationen (Stichworte aus vorangegangenen Suchanfragen), und Alltagswissen. • Verbesserte Bildbeschreibungen. Bei der Objekterkennung durch Computer Vision kommt es des Öfteren zu Fehldetektionen und Inkohärenzen. Die korrekte Identifikation von Bildinhalten ist jedoch eine wichtige Voraussetzung für die Suche nach Bildern mittels textueller Suchanfragen. Um die Fehleranfälligkeit bei der Objekterkennung zu minimieren, schlagen wir vor Alltagswissen einzubeziehen. Durch zusätzliche Bild-Annotationen, welche sich durch den gesunden Menschenverstand als thematisch passend erweisen, können viele fehlerhafte und zusammenhanglose Erkennungen vermieden werden. • Bild-Text Platzierung. Auf Internetseiten mit Text- und Bildinhalten (wie Nachrichtenseiten, Blogbeiträge, Artikel in sozialen Medien) werden Bilder in der Regel an semantisch sinnvollen Positionen im Textfluss platziert. Wir nutzen dies um ein Framework vorzuschlagen, in dem relevante Bilder ausgesucht werden und mit den passenden Abschnitten eines Textes assoziiert werden. • Bildunterschriften. Bilder, die als Teil von multimodalen Inhalten zur Verbesserung der Lesbarkeit von Texten dienen, haben typischerweise Bildunterschriften, die zum Kontext des umgebenden Texts passen. Wir schlagen vor, den Kontext beim automatischen Generieren von Bildunterschriften ebenfalls einzubeziehen. Üblicherweise werden hierfür die Bilder allein analysiert. Wir stellen die kontextbezogene Bildunterschriftengenerierung vor. Unsere vielversprechenden Beobachtungen und Ergebnisse eröffnen interessante Möglichkeiten für weitergehende Forschung zur computergestützten Erfassung des Zusammenspiels von Text- und Bildinhalten
    corecore