8,328 research outputs found
Sorting out Lipschitz function approximation
Training neural networks under a strict Lipschitz constraint is useful for
provable adversarial robustness, generalization bounds, interpretable
gradients, and Wasserstein distance estimation. By the composition property of
Lipschitz functions, it suffices to ensure that each individual affine
transformation or nonlinear activation is 1-Lipschitz. The challenge is to do
this while maintaining the expressive power. We identify a necessary property
for such an architecture: each of the layers must preserve the gradient norm
during backpropagation. Based on this, we propose to combine a gradient norm
preserving activation function, GroupSort, with norm-constrained weight
matrices. We show that norm-constrained GroupSort architectures are universal
Lipschitz function approximators. Empirically, we show that norm-constrained
GroupSort networks achieve tighter estimates of Wasserstein distance than their
ReLU counterparts and can achieve provable adversarial robustness guarantees
with little cost to accuracy.Comment: 8 main pages, 21 pages total, 17 figures. Accepted at ICML 201
Provable Lipschitz Certification for Generative Models
We present a scalable technique for upper bounding the Lipschitz constant of
generative models. We relate this quantity to the maximal norm over the set of
attainable vector-Jacobian products of a given generative model. We approximate
this set by layerwise convex approximations using zonotopes. Our approach
generalizes and improves upon prior work using zonotope transformers and we
extend to Lipschitz estimation of neural networks with large output dimension.
This provides efficient and tight bounds on small networks and can scale to
generative models on VAE and DCGAN architectures.Comment: Accepted into ICML 202
Preventing Gradient Attenuation in Lipschitz Constrained Convolutional Networks
Lipschitz constraints under L2 norm on deep neural networks are useful for
provable adversarial robustness bounds, stable training, and Wasserstein
distance estimation. While heuristic approaches such as the gradient penalty
have seen much practical success, it is challenging to achieve similar
practical performance while provably enforcing a Lipschitz constraint. In
principle, one can design Lipschitz constrained architectures using the
composition property of Lipschitz functions, but Anil et al. recently
identified a key obstacle to this approach: gradient norm attenuation. They
showed how to circumvent this problem in the case of fully connected networks
by designing each layer to be gradient norm preserving. We extend their
approach to train scalable, expressive, provably Lipschitz convolutional
networks. In particular, we present the Block Convolution Orthogonal
Parameterization (BCOP), an expressive parameterization of orthogonal
convolution operations. We show that even though the space of orthogonal
convolutions is disconnected, the largest connected component of BCOP with 2n
channels can represent arbitrary BCOP convolutions over n channels. Our BCOP
parameterization allows us to train large convolutional networks with provable
Lipschitz bounds. Empirically, we find that it is competitive with existing
approaches to provable adversarial robustness and Wasserstein distance
estimation.Comment: 9 main pages, 31 pages total, 3 figures. Accepted at 33rd Conference
on Neural Information Processing Systems (NeurIPS 2019
Lipschitz regularity of deep neural networks: analysis and efficient estimation
Deep neural networks are notorious for being sensitive to small well-chosen
perturbations, and estimating the regularity of such architectures is of utmost
importance for safe and robust practical applications. In this paper, we
investigate one of the key characteristics to assess the regularity of such
methods: the Lipschitz constant of deep learning architectures. First, we show
that, even for two layer neural networks, the exact computation of this
quantity is NP-hard and state-of-art methods may significantly overestimate it.
Then, we both extend and improve previous estimation methods by providing
AutoLip, the first generic algorithm for upper bounding the Lipschitz constant
of any automatically differentiable function. We provide a power method
algorithm working with automatic differentiation, allowing efficient
computations even on large convolutions. Second, for sequential neural
networks, we propose an improved algorithm named SeqLip that takes advantage of
the linear computation graph to split the computation per pair of consecutive
layers. Third we propose heuristics on SeqLip in order to tackle very large
networks. Our experiments show that SeqLip can significantly improve on the
existing upper bounds. Finally, we provide an implementation of AutoLip in the
PyTorch environment that may be used to better estimate the robustness of a
given neural network to small perturbations or regularize it using more precise
Lipschitz estimations.Comment: 12 pages, 3 figure
Invertible Residual Networks
We show that standard ResNet architectures can be made invertible, allowing
the same model to be used for classification, density estimation, and
generation. Typically, enforcing invertibility requires partitioning dimensions
or restricting network architectures. In contrast, our approach only requires
adding a simple normalization step during training, already available in
standard frameworks. Invertible ResNets define a generative model which can be
trained by maximum likelihood on unlabeled data. To compute likelihoods, we
introduce a tractable approximation to the Jacobian log-determinant of a
residual block. Our empirical evaluation shows that invertible ResNets perform
competitively with both state-of-the-art image classifiers and flow-based
generative models, something that has not been previously achieved with a
single architecture
Generalization bounds for deep convolutional neural networks
We prove bounds on the generalization error of convolutional networks. The
bounds are in terms of the training loss, the number of parameters, the
Lipschitz constant of the loss and the distance from the weights to the initial
weights. They are independent of the number of pixels in the input, and the
height and width of hidden feature maps. We present experiments using CIFAR-10
with varying hyperparameters of a deep convolutional network, comparing our
bounds with practical generalization gaps.Comment: Published as a conference paper at ICLR 202
On Breiman's Dilemma in Neural Networks: Phase Transitions of Margin Dynamics
Margin enlargement over training data has been an important strategy since
perceptrons in machine learning for the purpose of boosting the robustness of
classifiers toward a good generalization ability. Yet Breiman shows a dilemma
(Breiman, 1999) that a uniform improvement on margin distribution \emph{does
not} necessarily reduces generalization errors. In this paper, we revisit
Breiman's dilemma in deep neural networks with recently proposed spectrally
normalized margins. A novel perspective is provided to explain Breiman's
dilemma based on phase transitions in dynamics of normalized margin
distributions, that reflects the trade-off between expressive power of models
and complexity of data. When data complexity is comparable to the model
expressiveness in the sense that both training and test data share similar
phase transitions in normalized margin dynamics, two efficient ways are derived
to predict the trend of generalization or test error via classic margin-based
generalization bounds with restricted Rademacher complexities. On the other
hand, over-expressive models that exhibit uniform improvements on training
margins, as a distinct phase transition to test margin dynamics, may lose such
a prediction power and fail to prevent the overfitting. Experiments are
conducted to show the validity of the proposed method with some basic
convolutional networks, AlexNet, VGG-16, and ResNet-18, on several datasets
including Cifar10/100 and mini-ImageNet.Comment: 34 page
Why gradient clipping accelerates training: A theoretical justification for adaptivity
We provide a theoretical explanation for the effectiveness of gradient
clipping in training deep neural networks. The key ingredient is a new
smoothness condition derived from practical neural network training examples.
We observe that gradient smoothness, a concept central to the analysis of
first-order optimization algorithms that is often assumed to be a constant,
demonstrates significant variability along the training trajectory of deep
neural networks. Further, this smoothness positively correlates with the
gradient norm, and contrary to standard assumptions in the literature, it can
grow with the norm of the gradient. These empirical observations limit the
applicability of existing theoretical analyses of algorithms that rely on a
fixed bound on smoothness. These observations motivate us to introduce a novel
relaxation of gradient smoothness that is weaker than the commonly used
Lipschitz smoothness assumption. Under the new condition, we prove that two
popular methods, namely, \emph{gradient clipping} and \emph{normalized
gradient}, converge arbitrarily faster than gradient descent with fixed
stepsize. We further explain why such adaptively scaled gradient methods can
accelerate empirical convergence and verify our results empirically in popular
neural network training settings
Bounds for Vector-Valued Function Estimation
We present a framework to derive risk bounds for vector-valued learning with
a broad class of feature maps and loss functions. Multi-task learning and
one-vs-all multi-category learning are treated as examples. We discuss in
detail vector-valued functions with one hidden layer, and demonstrate that the
conditions under which shared representations are beneficial for multi- task
learning are equally applicable to multi-category learning
A Priori Estimates of the Population Risk for Two-layer Neural Networks
New estimates for the population risk are established for two-layer neural
networks. These estimates are nearly optimal in the sense that the error rates
scale in the same way as the Monte Carlo error rates. They are equally
effective in the over-parametrized regime when the network size is much larger
than the size of the dataset. These new estimates are a priori in nature in the
sense that the bounds depend only on some norms of the underlying functions to
be fitted, not the parameters in the model, in contrast with most existing
results which are a posteriori in nature. Using these a priori estimates, we
provide a perspective for understanding why two-layer neural networks perform
better than the related kernel methods.Comment: Published versio
- …