7 research outputs found
Scaling MLPs: A Tale of Inductive Bias
In this work we revisit the most fundamental building block in deep learning,
the multi-layer perceptron (MLP), and study the limits of its performance on
vision tasks. Empirical insights into MLPs are important for multiple reasons.
(1) Given the recent narrative "less inductive bias is better", popularized due
to transformers eclipsing convolutional models, it is natural to explore the
limits of this hypothesis. To that end, MLPs offer an ideal test bed, being
completely free of any inductive bias. (2) MLPs have almost exclusively been
the main protagonist in the deep learning theory literature due to their
mathematical simplicity, serving as a proxy to explain empirical phenomena
observed for more complex architectures. Surprisingly, experimental datapoints
for MLPs are very difficult to find in the literature, especially when coupled
with large pre-training protocols. This discrepancy between practice and theory
is worrying: Do MLPs reflect the empirical advances exhibited by practical
models? Or do theorists need to rethink the role of MLPs as a proxy? We provide
insights into both these aspects. We show that the performance of MLPs
drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on
TinyImageNet), highlighting that lack of inductive bias can indeed be
compensated. We observe that MLPs mimic the behaviour of their modern
counterparts faithfully, with some components in the learning setting however
surprisingly exhibiting stronger or unexpected behaviours. Due to their
inherent computational efficiency, large pre-training experiments become more
accessible for academic researchers. All of our experiments were run on a
single GPU
Random Teachers are Good Teachers
In this work, we investigate the implicit regularization induced by
teacher-student learning dynamics in self-distillation. To isolate its effect,
we describe a simple experiment where we consider teachers at random
initialization instead of trained teachers. Surprisingly, when distilling a
student into such a random teacher, we observe that the resulting model and its
representations already possess very interesting characteristics; (1) we
observe a strong improvement of the distilled student over its teacher in terms
of probing accuracy. (2) The learned representations are data-dependent and
transferable between different tasks but deteriorate strongly if trained on
random inputs. (3) The student checkpoint contains sparse subnetworks,
so-called lottery tickets, and lies on the border of linear basins in the
supervised loss landscape. These observations have interesting consequences for
several important areas in machine learning: (1) Self-distillation can work
solely based on the implicit regularization present in the gradient dynamics
without relying on any dark knowledge, (2) self-supervised learning can learn
features even in the absence of data augmentation and (3) training dynamics
during the early phase of supervised training do not necessarily require label
information. Finally, we shed light on an intriguing local property of the loss
landscape: the process of feature learning is strongly amplified if the student
is initialized closely to the teacher. These results raise interesting
questions about the nature of the landscape that have remained unexplored so
far. Code is available at https://github.com/safelix/dinopl
Harnessing Synthetic Datasets: The Role of Shape Bias in Deep Neural Network Generalization
Recent advancements in deep learning have been primarily driven by the use of
large models trained on increasingly vast datasets. While neural scaling laws
have emerged to predict network performance given a specific level of
computational resources, the growing demand for expansive datasets raises
concerns. To address this, a new research direction has emerged, focusing on
the creation of synthetic data as a substitute. In this study, we investigate
how neural networks exhibit shape bias during training on synthetic datasets,
serving as an indicator of the synthetic data quality. Specifically, our
findings indicate three key points: (1) Shape bias varies across network
architectures and types of supervision, casting doubt on its reliability as a
predictor for generalization and its ability to explain differences in model
recognition compared to human capabilities. (2) Relying solely on shape bias to
estimate generalization is unreliable, as it is entangled with diversity and
naturalism. (3) We propose a novel interpretation of shape bias as a tool for
estimating the diversity of samples within a dataset. Our research aims to
clarify the implications of using synthetic data and its associated shape bias
in deep learning, addressing concerns regarding generalization and dataset
quality
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their
computational cost, most of LLMs still adopt attention layers between all pairs
of tokens in the sequence, thus incurring a quadratic cost. In this study, we
present a novel approach that dynamically prunes contextual information while
preserving the model's expressiveness, resulting in reduced memory and
computational requirements during inference. Our method employs a learnable
mechanism that determines which uninformative tokens can be dropped from the
context at any point across the generation process. By doing so, our approach
not only addresses performance concerns but also enhances interpretability,
providing valuable insight into the model's decision-making process. Our
technique can be applied to existing pre-trained models through a
straightforward fine-tuning process, and the pruning strength can be specified
by a sparsity parameter. Notably, our empirical findings demonstrate that we
can effectively prune up to 80\% of the context without significant performance
degradation on downstream tasks, offering a valuable tool for mitigating
inference costs. Our reference implementation achieves up to increase
in inference throughput and even greater memory savings
Cosmology from Galaxy Redshift Surveys with PointNet
In recent years, deep learning approaches have achieved state-of-the-art
results in the analysis of point cloud data. In cosmology, galaxy redshift
surveys resemble such a permutation invariant collection of positions in space.
These surveys have so far mostly been analysed with two-point statistics, such
as power spectra and correlation functions. The usage of these summary
statistics is best justified on large scales, where the density field is linear
and Gaussian. However, in light of the increased precision expected from
upcoming surveys, the analysis of -- intrinsically non-Gaussian -- small
angular separations represents an appealing avenue to better constrain
cosmological parameters. In this work, we aim to improve upon two-point
statistics by employing a \textit{PointNet}-like neural network to regress the
values of the cosmological parameters directly from point cloud data. Our
implementation of PointNets can analyse inputs of galaxies at a time, which improves upon earlier work for
this application by roughly two orders of magnitude. Additionally, we
demonstrate the ability to analyse galaxy redshift survey data on the
lightcone, as opposed to previously static simulation boxes at a given fixed
redshift
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Aligning large language models (LLMs) with human preferences has proven to
drastically improve usability and has driven rapid adoption as demonstrated by
ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and
reinforcement learning from human feedback (RLHF) greatly reduce the required
skill and domain knowledge to effectively harness the capabilities of LLMs,
increasing their accessibility and utility across various domains. However,
state-of-the-art alignment techniques like RLHF rely on high-quality human
feedback data, which is expensive to create and often remains proprietary. In
an effort to democratize research on large-scale alignment, we release
OpenAssistant Conversations, a human-generated, human-annotated assistant-style
conversation corpus consisting of 161,443 messages in 35 different languages,
annotated with 461,292 quality ratings, resulting in over 10,000 complete and
fully annotated conversation trees. The corpus is a product of a worldwide
crowd-sourcing effort involving over 13,500 volunteers. Models trained on
OpenAssistant Conversations show consistent improvements on standard benchmarks
over respective base models. We release our code and data under a fully
permissive licence.Comment: Published in NeurIPS 2023 Datasets and Benchmark
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse
Transformers have achieved remarkable success in several domains, ranging
from natural language processing to computer vision. Nevertheless, it has been
recently shown that stacking self-attention layers - the distinctive
architectural component of Transformers - can result in rank collapse of the
tokens' representations at initialization. The question of if and how rank
collapse affects training is still largely unanswered, and its investigation is
necessary for a more comprehensive understanding of this architecture. In this
work, we shed new light on the causes and the effects of this phenomenon.
First, we show that rank collapse of the tokens' representations hinders
training by causing the gradients of the queries and keys to vanish at
initialization. Furthermore, we provide a thorough description of the origin of
rank collapse and discuss how to prevent it via an appropriate depth-dependent
scaling of the residual branches. Finally, our analysis unveils that specific
architectural hyperparameters affect the gradients of queries and values
differently, leading to disproportionate gradient norms. This suggests an
explanation for the widespread use of adaptive methods for Transformers'
optimization