Search CORE

17 research outputs found

An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild

Author: Changpinyo Soravit
Chao Wei-Lun
Gong Boqing
Sha Fei
Publication venue
Publication date: 01/01/2016
Field of study

Zero-shot learning (ZSL) methods have been studied in the unrealistic setting where test data are assumed to come from unseen classes only. In this paper, we advocate studying the problem of generalized zero-shot learning (GZSL) where the test data's class memberships are unconstrained. We show empirically that naively using the classifiers constructed by ZSL approaches does not perform well in the generalized setting. Motivated by this, we propose a simple but effective calibration method that can be used to balance two conflicting forces: recognizing data from seen classes versus those from unseen ones. We develop a performance metric to characterize such a trade-off and examine the utility of this metric in evaluating various ZSL approaches. Our analysis further shows that there is a large gap between the performance of existing approaches and an upper bound established via idealized semantic embeddings, suggesting that improving class semantic embeddings is vital to GZSL.Comment: ECCV2016 camera-read

arXiv.org e-Print Archive

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Weakly Supervised Content Selection for Improved Image Captioning

Author: Chandu Khyathi Raghavi
Changpinyo Soravit
Sharma Piyush
Soricut Radu
Thapliyal Ashish
Publication venue
Publication date: 10/09/2020
Field of study

Image captioning involves identifying semantic concepts in the scene and describing them in fluent natural language. Recent approaches do not explicitly model the semantic concepts and train the model only for the end goal of caption generation. Such models lack interpretability and controllability, primarily due to sub-optimal content selection. We address this problem by breaking down the captioning task into two simpler, manageable and more controllable tasks -- skeleton prediction and skeleton-based caption generation. We approach the former as a weakly supervised task, using a simple off-the-shelf language syntax parser and avoiding the need for additional human annotations; the latter uses a supervised-learning approach. We investigate three methods of conditioning the caption on skeleton in the encoder, decoder and both. Our compositional model generates significantly better quality captions on out of domain test images, as judged by human annotators. Additionally, we demonstrate the cross-language effectiveness of the English skeleton to other languages including French, Italian, German, Spanish and Hindi. This compositional nature of captioning exhibits the potential of unpaired image captioning, thereby reducing the dependence on expensive image-caption pairs. Furthermore, we investigate the use of skeletons as a knob to control certain properties of the generated image caption, such as length, content, and gender expression

arXiv.org e-Print Archive

What You See is What You Read? Improving Text-Image Alignment Evaluation

Author: Aharoni Roee
Bitton Yonatan
Changpinyo Soravit
Herzig Jonathan
Lang Oran
Ofek Eran
Szpektor Idan
Yarom Michal
Publication venue
Publication date: 13/07/2023
Field of study

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation

arXiv.org e-Print Archive

PreSTU: Pre-Training for Scene-Text Understanding

Author: Changpinyo Soravit
Chao Wei-Lun
Chen Xi
Goodman Sebastian
Hu Hexiang
Kil Jihyung
Soricut Radu
Publication venue
Publication date: 19/08/2023
Field of study

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.Comment: Accepted to ICCV 202

arXiv.org e-Print Archive