The implicit bias towards solutions with favorable properties is believed to
be a key reason why neural networks trained by gradient-based optimization can
generalize well. While the implicit bias of gradient flow has been widely
studied for homogeneous neural networks (including ReLU and leaky ReLU
networks), the implicit bias of gradient descent is currently only understood
for smooth neural networks. Therefore, implicit bias in non-smooth neural
networks trained by gradient descent remains an open question. In this paper,
we aim to answer this question by studying the implicit bias of gradient
descent for training two-layer fully connected (leaky) ReLU neural networks. We
showed that when the training data are nearly-orthogonal, for leaky ReLU
activation function, gradient descent will find a network with a stable rank
that converges to 1, whereas for ReLU activation function, gradient descent
will find a neural network with a stable rank that is upper bounded by a
constant. Additionally, we show that gradient descent will find a neural
network such that all the training data points have the same normalized margin
asymptotically. Experiments on both synthetic and real data backup our
theoretical findings.Comment: 55 pages, 7 figures. In NeurIPS 202