Search CORE

2 research outputs found

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Author: Beaumont Romain
Coombes Theo
Jitsev Jenia
Kaczmarczyk Robert
Katta Aarush
Komatsuzaki Aran
Mullis Clayton
Schuhmann Christoph
Vencu Richard
Publication venue
Publication date: 01/01/2021
Field of study

Multi-modal language-vision models trained on hundreds of millions of image-textpairs (e.g. CLIP, DALL-E) gained a recent surge, showing remarkable capability toperform zero- or few-shot learning and transfer even in absence of per-sample labelson target image data. Despite this trend, to date there has been no publicly availabledatasets of sufficient scale for training such models from scratch. To address thisissue, in a community effort we build and release for public LAION-400M, adataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddingsand kNN indices that allow efficient similarity searc

arXiv.org e-Print Archive

Juelich Shared Electronic Resources

LAION-5B: An open large-scale dataset for training next generation image-text models

Author: Beaumont Romain
Cherti Mehdi
Coombes Theo
Crowson Katherine
Gordon Cade
Jitsev Jenia
Kaczmarczyk Robert
Katta Aarush
Kundurthy Srivatsa
Mullis Clayton
Schmidt Ludwig
Schramowsk Patrick
Schuhmann Christoph
Vencu Richard
Wightman Ross
Wortsman Mitchell
Publication venue: Curran Associates, Inc.
Publication date: 01/01/2022
Field of study

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN, BASIC, GLIDE, Flamingo and Imagen made further improvements. Studying the training and capabilities of such models requires datasets containing billions of image-text pairs. Until now, no datasets of this size have been made openly available for the broader research community. To address this problem and democratize research on large-scale multi-modal models, we present LAION-5B - a dataset consisting of 5.85 billion CLIP-filteredimage-text pairs, of which 2.32B contain English language. We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset, and discuss further experiments enabled with an openly available dataset of this scale. Additionally we provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation, and detection scores for watermark, NSFW, and toxic content detection

Juelich Shared Electronic Resources