15 research outputs found
Co-training Submodels for Visual Recognition
We introduce submodel co-training, a regularization method related to
co-training, self-distillation and stochastic depth. Given a neural network to
be trained, for each sample we implicitly instantiate two altered networks,
``submodels'', with stochastic depth: we activate only a subset of the layers.
Each network serves as a soft teacher to the other, by providing a loss that
complements the regular loss provided by the one-hot label. Our approach,
dubbed cosub, uses a single set of weights, and does not involve a pre-trained
external model or temporal averaging.
Experimentally, we show that submodel co-training is effective to train
backbones for recognition tasks such as image classification and semantic
segmentation. Our approach is compatible with multiple architectures, including
RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their
results in comparable settings. For instance, a ViT-B pretrained with cosub on
ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val
Code Llama: Open Foundation Models for Code
We release Code Llama, a family of large language models for code based on
Llama 2 providing state-of-the-art performance among open models, infilling
capabilities, support for large input contexts, and zero-shot instruction
following ability for programming tasks. We provide multiple flavors to cover a
wide range of applications: foundation models (Code Llama), Python
specializations (Code Llama - Python), and instruction-following models (Code
Llama - Instruct) with 7B, 13B and 34B parameters each. All models are trained
on sequences of 16k tokens and show improvements on inputs with up to 100k
tokens. 7B and 13B Code Llama and Code Llama - Instruct variants support
infilling based on surrounding content. Code Llama reaches state-of-the-art
performance among open models on several code benchmarks, with scores of up to
53% and 55% on HumanEval and MBPP, respectively. Notably, Code Llama - Python
7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform
every other publicly available model on MultiPL-E. We release Code Llama under
a permissive license that allows for both research and commercial use
Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
Pre-training models on large scale datasets, like Ima-geNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pretraining solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting