Search CORE

17 research outputs found

Retrieval-Augmented Multimodal Language Modeling

Author: Aghajanyan Armen
James Rich
Leskovec Jure
Lewis Mike
Liang Percy
Shi Weijia
Yasunaga Michihiro
Yih Wen-tau
Zettlemoyer Luke
Publication venue
Publication date: 22/11/2022
Field of study

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant knowledge fetched by a retriever from external memory (e.g., multimodal documents on the web). Specifically, we implement a retriever using the pretrained CLIP model and a generator using the CM3 Transformer architecture, and train this model using the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate mixtures of text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALL-E). Moreover, we show that RA-CM3 exhibits novel capabilities such as knowledge-intensive image generation and multimodal in-context learning

arXiv.org e-Print Archive

InCoder: A Generative Model for Code Infilling and Synthesis

Author: Aghajanyan Armen
Fried Daniel
Lewis Mike
Lin Jessy
Shi Freda
Wallace Eric
Wang Sida
Yih Wen-tau
Zettlemoyer Luke
Zhong Ruiqi
Publication venue
Publication date: 17/04/2022
Field of study

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released. https://sites.google.com/view/incoder-code-modelsComment: 25 pages, 13 figures. v2: added NeoX-20B results & StackOverflow corpus inf

arXiv.org e-Print Archive

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Author: Aghajanyan Armen
Ashual Oron
Babu Arun
Celikyilmaz Asli
Fazel-Zarandi Maryam
Ghosh Gargi
Golovneva Olga
Howes Russell
James Richard
Karrer Brian
Li Shang-Wen
Muller Benjamin
Pasunuru Ramakanth
Polyak Adam
Ross Candace
Sharma Vasu
Sheynin Shelly
Shi Bowen
Singer Uriel
Taigman Yaniv
Tamoyan Hovhannes
Tang Binh
Wang Tianlu
Xu Puxin
Yu Lili
Zettlemoyer Luke
Zhang Susan
Publication venue
Publication date: 05/09/2023
Field of study

We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation

arXiv.org e-Print Archive