3 research outputs found
Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs
Experimental results have shown that curriculum learning, i.e., presenting
simpler examples before more complex ones, can improve the efficiency of
learning. Some recent theoretical results also showed that changing the
sampling distribution can help neural networks learn parities, with formal
results only for large learning rates and one-step arguments. Here we show a
separation result in the number of training steps with standard (bounded)
learning rates on a common sample distribution: if the data distribution is a
mixture of sparse and dense inputs, there exists a regime in which a 2-layer
ReLU neural network trained by a curriculum noisy-GD (or SGD) algorithm that
uses sparse examples first, can learn parities of sufficiently large degree,
while any fully connected neural network of possibly larger width or depth
trained by noisy-GD on the unordered samples cannot learn without additional
steps. We also provide experimental results supporting the qualitative
separation beyond the specific regime of the theoretical results.Comment: 34 pages, 8 figure
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
This paper considers the learning of logical (Boolean) functions with focus
on the generalization on the unseen (GOTU) setting, a strong case of
out-of-distribution generalization. This is motivated by the fact that the rich
combinatorial nature of data in certain reasoning tasks (e.g.,
arithmetic/logic) makes representative data sampling challenging, and learning
successfully under GOTU gives a first vignette of an 'extrapolating' or
'reasoning' learner. We then study how different network architectures trained
by (S)GD perform under GOTU and provide both theoretical and experimental
evidence that for a class of network models including instances of
Transformers, random features models, and diagonal linear networks, a
min-degree-interpolator is learned on the unseen. We also provide evidence that
other instances with larger learning rates or mean-field networks reach leaky
min-degree solutions. These findings lead to two implications: (1) we provide
an explanation to the length generalization problem (e.g., Anil et al. 2022);
(2) we introduce a curriculum learning algorithm called Degree-Curriculum that
learns monomials more efficiently by incrementing supports.Comment: To appear in ICML 202
Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced
in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce
the label. More generally, the paper considers the learning of logical
functions with gradient descent (GD) on neural networks. It is first shown that
in order to learn logical functions with gradient descent on symmetric neural
networks, the generalization error can be lower-bounded in terms of the
noise-stability of the target function, supporting a conjecture made in
[ZRKB21]. It is then shown that in the distribution shift setting, when the
data withholding corresponds to freezing a single feature (referred to as
canonical holdout), the generalization error of gradient descent admits a tight
characterization in terms of the Boolean influence for several relevant
architectures. This is shown on linear models and supported experimentally on
other models such as MLPs and Transformers. In particular, this puts forward
the hypothesis that for such architectures and for learning logical functions
such as PVR functions, GD tends to have an implicit bias towards low-degree
representations, which in turn gives the Boolean influence for the
generalization error under quadratic loss.Comment: 28 pages, 8 figure