17 research outputs found
Retrieval-Augmented Multimodal Language Modeling
Recent multimodal models such as DALL-E and CM3 have achieved remarkable
progress in text-to-image and image-to-text generation. However, these models
store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the
model parameters, requiring increasingly larger models and training data to
capture more knowledge. To integrate knowledge in a more scalable and modular
way, we propose a retrieval-augmented multimodal model, which enables a base
multimodal model (generator) to refer to relevant knowledge fetched by a
retriever from external memory (e.g., multimodal documents on the web).
Specifically, we implement a retriever using the pretrained CLIP model and a
generator using the CM3 Transformer architecture, and train this model using
the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3),
is the first multimodal model that can retrieve and generate mixtures of text
and images. We show that RA-CM3 significantly outperforms baseline multimodal
models such as DALL-E and CM3 on both image and caption generation tasks (12
FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute
for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel
capabilities such as knowledge-intensive image generation and multimodal
in-context learning
InCoder: A Generative Model for Code Infilling and Synthesis
Code is seldom written in a single left-to-right pass and is instead
repeatedly edited and refined. We introduce InCoder, a unified generative model
that can perform program synthesis (via left-to-right generation) as well as
editing (via infilling). InCoder is trained to generate code files from a large
corpus of permissively licensed code, where regions of code have been randomly
masked and moved to the end of each file, allowing code infilling with
bidirectional context. Our model is the first generative model that is able to
directly perform zero-shot code infilling, which we evaluate on challenging
tasks such as type inference, comment generation, and variable re-naming. We
find that the ability to condition on bidirectional context substantially
improves performance on these tasks, while still performing comparably on
standard program synthesis benchmarks in comparison to left-to-right only
models pretrained at similar scale. The InCoder models and code are publicly
released. https://sites.google.com/view/incoder-code-modelsComment: 25 pages, 13 figures. v2: added NeoX-20B results & StackOverflow
corpus inf
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented,
token-based, decoder-only multi-modal language model capable of generating and
infilling both text and images. CM3Leon uses the CM3 multi-modal architecture
but additionally shows the extreme benefits of scaling up and tuning on more
diverse instruction-style data. It is the first multi-modal model trained with
a recipe adapted from text-only language models, including a large-scale
retrieval-augmented pre-training stage and a second multi-task supervised
fine-tuning (SFT) stage. It is also a general-purpose model that can do both
text-to-image and image-to-text generation, allowing us to introduce
self-contained contrastive decoding methods that produce high-quality outputs.
Extensive experiments demonstrate that this recipe is highly effective for
multi-modal models. CM3Leon achieves state-of-the-art performance in
text-to-image generation with 5x less training compute than comparable methods
(zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate
unprecedented levels of controllability in tasks ranging from language-guided
image editing to image-controlled generation and segmentation