2 research outputs found
A Deep Conditioning Treatment of Neural Networks
We study the role of depth in training randomly initialized overparameterized
neural networks. We give a general result showing that depth improves
trainability of neural networks by improving the conditioning of certain kernel
matrices of the input data. This result holds for arbitrary non-linear
activation functions under a certain normalization. We provide versions of the
result that hold for training just the top layer of the neural network, as well
as for training all layers, via the neural tangent kernel. As applications of
these general results, we provide a generalization of the results of Das et al.
(2019) showing that learnability of deep random neural networks with a large
class of non-linear activations degrades exponentially with depth. We also show
how benign overfitting can occur in deep neural networks via the results of
Bartlett et al. (2019b). We also give experimental evidence that normalized
versions of ReLU are a viable alternative to more complex operations like Batch
Normalization in training deep neural networks.Comment: In proceedings of ALT 202
Hardness of Learning Neural Networks with Natural Weights
Neural networks are nowadays highly successful despite strong hardness
results. The existing hardness results focus on the network architecture, and
assume that the network's weights are arbitrary. A natural approach to settle
the discrepancy is to assume that the network's weights are "well-behaved" and
posses some generic properties that may allow efficient learning. This approach
is supported by the intuition that the weights in real-world networks are not
arbitrary, but exhibit some "random-like" properties with respect to some
"natural" distributions. We prove negative results in this regard, and show
that for depth- networks, and many "natural" weights distributions such as
the normal and the uniform distribution, most networks are hard to learn.
Namely, there is no efficient learning algorithm that is provably successful
for most weights, and every input distribution. It implies that there is no
generic property that holds with high probability in such random networks and
allows efficient learning