148 research outputs found
Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization
While personalized recommendations systems have become increasingly popular,
ensuring user data protection remains a top concern in the development of these
learning systems. A common approach to enhancing privacy involves training
models using anonymous data rather than individual data. In this paper, we
explore a natural technique called \emph{look-alike clustering}, which involves
replacing sensitive features of individuals with the cluster's average values.
We provide a precise analysis of how training models using anonymous cluster
centers affects their generalization capabilities. We focus on an asymptotic
regime where the size of the training set grows in proportion to the features
dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT)
and allows us to theoretically understand the role of different model
components on the generalization error. In addition, we demonstrate that in
certain high-dimensional regimes, training over anonymous cluster centers acts
as a regularization and improves generalization error of the trained models.
Finally, we corroborate our asymptotic theory with finite-sample numerical
experiments where we observe a perfect match when the sample size is only of
order of a few hundreds.Comment: accepted at the Conference on Neural Information Processing Systems
(NeurIPS 2023
Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics
Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2
Geometric compression of invariant manifolds in neural nets
We study how neural networks compress uninformative input space in models
where data lie in dimensions, but whose label only vary within a linear
manifold of dimension . We show that for a one-hidden layer
network initialized with infinitesimal weights (i.e. in the \textit{feature
learning} regime) trained with gradient descent, the uninformative
space is compressed by a factor ,
where is the size of the training set. We quantify the benefit of such a
compression on the test error . For large initialization of the
weights (the \textit{lazy training} regime), no compression occurs and for
regular boundaries separating labels we find that ,
with . Compression improves the learning curves
so that if and
if . We test
these predictions for a stripe model where boundaries are parallel interfaces
() as well as for a cylindrical boundary (). Next
we show that compression shapes the Neural Tangent Kernel (NTK) evolution in
time, so that its top eigenvectors become more informative and display a larger
projection on the labels. Consequently, kernel learning with the frozen NTK at
the end of training outperforms the initial NTK. We confirm these predictions
both for a one-hidden layer FC network trained on the stripe model and for a
16-layers CNN trained on MNIST, for which we also find
. The great similarities found in these
two cases support that compression is central to the training of MNIST, and
puts forward kernel-PCA on the evolving NTK as a useful diagnostic of
compression in deep nets
Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples
Adversarial Attacks are still a significant challenge for neural networks.
Recent work has shown that adversarial perturbations typically contain
high-frequency features, but the root cause of this phenomenon remains unknown.
Inspired by theoretical work on linear full-width convolutional models, we
hypothesize that the local (i.e. bounded-width) convolutional operations
commonly used in current neural networks are implicitly biased to learn high
frequency features, and that this is one of the root causes of high frequency
adversarial examples. To test this hypothesis, we analyzed the impact of
different choices of linear and nonlinear architectures on the implicit bias of
the learned features and the adversarial perturbations, in both spatial and
frequency domains. We find that the high-frequency adversarial perturbations
are critically dependent on the convolution operation because the
spatially-limited nature of local convolutions induces an implicit bias towards
high frequency features. The explanation for the latter involves the Fourier
Uncertainty Principle: a spatially-limited (local in the space domain) filter
cannot also be frequency-limited (local in the frequency domain). Furthermore,
using larger convolution kernel sizes or avoiding convolutions (e.g. by using
Vision Transformers architecture) significantly reduces this high frequency
bias, but not the overall susceptibility to attacks. Looking forward, our work
strongly suggests that understanding and controlling the implicit bias of
architectures will be essential for achieving adversarial robustness.Comment: 20 pages, 11 figures, 12 Table
Evaluating the Convergence Limit of Quantum Neural Tangent Kernel
Quantum variational algorithms have been one of major applications of quantum
computing with current quantum devices. There are recent attempts to establish
the foundation for these algorithms. A possible approach is to characterize the
training dynamics with quantum neural tangent kernel. In this work, we
construct the kernel for two models, Quantun Ensemble and Quantum Neural
Network, and show the convergence of these models in the limit of infinitely
many qubits. We also show applications of the kernel limit in regression tasks
Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty
Among attempts at giving a theoretical account of the success of deep neural
networks, a recent line of work has identified a so-called `lazy' regime in
which the network can be well approximated by its linearization around
initialization. Here we investigate the comparative effect of the lazy (linear)
and feature learning (non-linear) regimes on subgroups of examples based on
their difficulty. Specifically, we show that easier examples are given more
weight in feature learning mode, resulting in faster training compared to more
difficult ones. In other words, the non-linear dynamics tends to sequentialize
the learning of examples of increasing difficulty. We illustrate this
phenomenon across different ways to quantify example difficulty, including
c-score, label noise, and in the presence of spurious correlations. Our results
reveal a new understanding of how deep networks prioritize resources across
example difficulty
Non-negative Least Squares via Overparametrization
In many applications, solutions of numerical problems are required to be
non-negative, e.g., when retrieving pixel intensity values or physical
densities of a substance. In this context, non-negative least squares (NNLS) is
a ubiquitous tool, e.g., when seeking sparse solutions of high-dimensional
statistical problems. Despite vast efforts since the seminal work of Lawson and
Hanson in the '70s, the non-negativity assumption is still an obstacle for the
theoretical analysis and scalability of many off-the-shelf solvers. In the
different context of deep neural networks, we recently started to see that the
training of overparametrized models via gradient descent leads to surprising
generalization properties and the retrieval of regularized solutions. In this
paper, we prove that, by using an overparametrized formulation, NNLS solutions
can reliably be approximated via vanilla gradient flow. We furthermore
establish stability of the method against negative perturbations of the
ground-truth. Our simulations confirm that this allows the use of vanilla
gradient descent as a novel and scalable numerical solver for NNLS. From a
conceptual point of view, our work proposes a novel approach to trading
side-constraints in optimization problems against complexity of the
optimization landscape, which does not build upon the concept of Lagrangian
multipliers
- …