Search CORE

320 research outputs found

FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

Author: Bensaid David
Brody Shaked
Ganz Roy
Kimmel Ron
Rotstein Noam
Publication venue
Publication date: 28/05/2023
Field of study

Image captioning is a central task in computer vision which has experienced substantial progress following the advent of vision-language pre-training techniques. In this paper, we highlight a frequently overlooked limitation of captioning models that often fail to capture semantically significant elements. This drawback can be traced back to the text-image datasets; while their captions typically offer a general depiction of image content, they frequently omit salient details. To mitigate this limitation, we propose FuseCap - a novel method for enriching captions with additional visual information, obtained from vision experts, such as object detectors, attribute recognizers, and Optical Character Recognizers (OCR). Our approach fuses the outputs of such vision experts with the original caption using a large language model (LLM), yielding enriched captions that present a comprehensive image description. We validate the effectiveness of the proposed caption enrichment method through both quantitative and qualitative analysis. Our method is then used to curate the training set of a captioning model based BLIP which surpasses current state-of-the-art approaches in generating accurate and detailed captions while using significantly fewer parameters and training data. As additional contributions, we provide a dataset comprising of 12M image-enriched caption pairs and show that the proposed method largely improves image-text retrieval

arXiv.org e-Print Archive

Large Scale Retrieval and Generation of Image Descriptions

Author: Berg Alexander C.
Berg Tamara L.
Choi Yejin
Daumé Hal
Dodge Jesse
Goyal Amit
Han Xufeng
Kulkarni Girish
Kuznetsova Polina
Mensch Alyssa
Mitchell Margaret
Ordonez Vicente
Stratos Karl
Yamaguchi Kota
Publication venue
Publication date: 01/01/2016
Field of study

What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts

Carolina Digital Repository