Towards Language Models That Can See: Computer Vision Through the LENS
  of Natural Language

Berrios, William; Kiela, Douwe; Mittal, Gautam; Singh, Amanpreet; Thrush, Tristan

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Authors: William Berrios
Douwe Kiela
Gautam Mittal
Amanpreet Singh
Tristan Thrush
Publication date: 28 June 2023
Publisher

Abstract

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.16410

Last time updated on 02/07/2023