6,447 research outputs found
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
How to efficiently transform large language models (LLMs) into instruction
followers is recently a popular research direction, while training LLM for
multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter
demonstrates the potential to handle visual inputs with LLMs, it still cannot
generalize well to open-ended visual instructions and lags behind GPT-4. In
this paper, we present LLaMA-Adapter V2, a parameter-efficient visual
instruction model. Specifically, we first augment LLaMA-Adapter by unlocking
more learnable parameters (e.g., norm, bias and scale), which distribute the
instruction-following ability across the entire LLaMA model besides adapters.
Secondly, we propose an early fusion strategy to feed visual tokens only into
the early LLM layers, contributing to better visual knowledge incorporation.
Thirdly, a joint training paradigm of image-text pairs and
instruction-following data is introduced by optimizing disjoint groups of
learnable parameters. This strategy effectively alleviates the interference
between the two tasks of image-text alignment and instruction following and
achieves strong multi-modal reasoning with only a small-scale image-text and
instruction dataset. During inference, we incorporate additional expert models
(e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image
understanding capability without incurring training costs. Compared to the
original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal
instructions by merely introducing 14M parameters over LLaMA. The newly
designed framework also exhibits stronger language-only instruction-following
capabilities and even excels in chat interactions. Our code and models are
available at https://github.com/ZrrSkywalker/LLaMA-Adapter.Comment: Code and models are available at
https://github.com/ZrrSkywalker/LLaMA-Adapte
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models
Prompt engineering is a technique that involves augmenting a large
pre-trained model with task-specific hints, known as prompts, to adapt the
model to new tasks. Prompts can be created manually as natural language
instructions or generated automatically as either natural language instructions
or vector representations. Prompt engineering enables the ability to perform
predictions based solely on prompts without updating model parameters, and the
easier application of large pre-trained models in real-world tasks. In past
years, Prompt engineering has been well-studied in natural language processing.
Recently, it has also been intensively studied in vision-language modeling.
However, there is currently a lack of a systematic overview of prompt
engineering on pre-trained vision-language models. This paper aims to provide a
comprehensive survey of cutting-edge research in prompt engineering on three
types of vision-language models: multimodal-to-text generation models (e.g.
Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation
models (e.g. Stable Diffusion). For each type of model, a brief model summary,
prompting methods, prompting-based applications, and the corresponding
responsibility and integrity issues are summarized and discussed. Furthermore,
the commonalities and differences between prompting on vision-language models,
language models, and vision models are also discussed. The challenges, future
directions, and research opportunities are summarized to foster future research
on this topic
Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns
visual concepts, words, and semantic parsing of sentences without explicit
supervision on any of them; instead, our model learns by simply looking at
images and reading paired questions and answers. Our model builds an
object-based scene representation and translates sentences into executable,
symbolic programs. To bridge the learning of two modules, we use a
neuro-symbolic reasoning module that executes these programs on the latent
scene representation. Analogical to human concept learning, the perception
module learns visual concepts based on the language description of the object
being referred to. Meanwhile, the learned visual concepts facilitate learning
new words and parsing new sentences. We use curriculum learning to guide the
searching over the large compositional space of images and language. Extensive
experiments demonstrate the accuracy and efficiency of our model on learning
visual concepts, word representations, and semantic parsing of sentences.
Further, our method allows easy generalization to new object attributes,
compositions, language concepts, scenes and questions, and even new program
domains. It also empowers applications including visual question answering and
bidirectional image-text retrieval.Comment: ICLR 2019 (Oral). Project page: http://nscl.csail.mit.edu
- …