320 research outputs found
FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions
Image captioning is a central task in computer vision which has experienced
substantial progress following the advent of vision-language pre-training
techniques. In this paper, we highlight a frequently overlooked limitation of
captioning models that often fail to capture semantically significant elements.
This drawback can be traced back to the text-image datasets; while their
captions typically offer a general depiction of image content, they frequently
omit salient details. To mitigate this limitation, we propose FuseCap - a novel
method for enriching captions with additional visual information, obtained from
vision experts, such as object detectors, attribute recognizers, and Optical
Character Recognizers (OCR). Our approach fuses the outputs of such vision
experts with the original caption using a large language model (LLM), yielding
enriched captions that present a comprehensive image description. We validate
the effectiveness of the proposed caption enrichment method through both
quantitative and qualitative analysis. Our method is then used to curate the
training set of a captioning model based BLIP which surpasses current
state-of-the-art approaches in generating accurate and detailed captions while
using significantly fewer parameters and training data. As additional
contributions, we provide a dataset comprising of 12M image-enriched caption
pairs and show that the proposed method largely improves image-text retrieval
Large Scale Retrieval and Generation of Image Descriptions
What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts
- …