Feature learning, i.e. extracting meaningful representations of data, is
quintessential to the practical success of neural networks trained with
gradient descent, yet it is notoriously difficult to explain how and why it
occurs. Recent theoretical studies have shown that shallow neural networks
optimized on a single task with gradient-based methods can learn meaningful
features, extending our understanding beyond the neural tangent kernel or
random feature regime in which negligible feature learning occurs. But in
practice, neural networks are increasingly often trained on {\em many} tasks
simultaneously with differing loss functions, and these prior analyses do not
generalize to such settings. In the multi-task learning setting, a variety of
studies have shown effective feature learning by simple linear models. However,
multi-task learning via {\em nonlinear} models, arguably the most common
learning paradigm in practice, remains largely mysterious. In this work, we
present the first results proving feature learning occurs in a multi-task
setting with a nonlinear model. We show that when the tasks are binary
classification problems with labels depending on only r directions within the
ambient d≫r-dimensional input space, executing a simple gradient-based
multitask learning algorithm on a two-layer ReLU neural network learns the
ground-truth r directions. In particular, any downstream task on the r
ground-truth coordinates can be solved by learning a linear classifier with
sample and neuron complexity independent of the ambient dimension d, while a
random feature model requires exponential complexity in d for such a
guarantee