125 research outputs found

    Anonymous Learning via Look-Alike Clustering: A Precise Analysis of Model Generalization

    Full text link
    While personalized recommendations systems have become increasingly popular, ensuring user data protection remains a top concern in the development of these learning systems. A common approach to enhancing privacy involves training models using anonymous data rather than individual data. In this paper, we explore a natural technique called \emph{look-alike clustering}, which involves replacing sensitive features of individuals with the cluster's average values. We provide a precise analysis of how training models using anonymous cluster centers affects their generalization capabilities. We focus on an asymptotic regime where the size of the training set grows in proportion to the features dimension. Our analysis is based on the Convex Gaussian Minimax Theorem (CGMT) and allows us to theoretically understand the role of different model components on the generalization error. In addition, we demonstrate that in certain high-dimensional regimes, training over anonymous cluster centers acts as a regularization and improves generalization error of the trained models. Finally, we corroborate our asymptotic theory with finite-sample numerical experiments where we observe a perfect match when the sample size is only of order of a few hundreds.Comment: accepted at the Conference on Neural Information Processing Systems (NeurIPS 2023

    Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics

    Get PDF
    Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2

    Geometric compression of invariant manifolds in neural nets

    Full text link
    We study how neural networks compress uninformative input space in models where data lie in dd dimensions, but whose label only vary within a linear manifold of dimension d∥<dd_\parallel < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the \textit{feature learning} regime) trained with gradient descent, the uninformative d⊥=d−d∥d_\perp=d-d_\parallel space is compressed by a factor λ∼p\lambda\sim \sqrt{p}, where pp is the size of the training set. We quantify the benefit of such a compression on the test error ϵ\epsilon. For large initialization of the weights (the \textit{lazy training} regime), no compression occurs and for regular boundaries separating labels we find that ϵ∼p−β\epsilon \sim p^{-\beta}, with βLazy=d/(3d−2)\beta_\text{Lazy} = d / (3d-2). Compression improves the learning curves so that βFeature=(2d−1)/(3d−2)\beta_\text{Feature} = (2d-1)/(3d-2) if d∥=1d_\parallel = 1 and βFeature=(d+d⊥/2)/(3d−2)\beta_\text{Feature} = (d + d_\perp/2)/(3d-2) if d∥>1d_\parallel > 1. We test these predictions for a stripe model where boundaries are parallel interfaces (d∥=1d_\parallel=1) as well as for a cylindrical boundary (d∥=2d_\parallel=2). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find βFeature>βLazy\beta_\text{Feature}>\beta_\text{Lazy}. The great similarities found in these two cases support that compression is central to the training of MNIST, and puts forward kernel-PCA on the evolving NTK as a useful diagnostic of compression in deep nets

    Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples

    Full text link
    Adversarial Attacks are still a significant challenge for neural networks. Recent work has shown that adversarial perturbations typically contain high-frequency features, but the root cause of this phenomenon remains unknown. Inspired by theoretical work on linear full-width convolutional models, we hypothesize that the local (i.e. bounded-width) convolutional operations commonly used in current neural networks are implicitly biased to learn high frequency features, and that this is one of the root causes of high frequency adversarial examples. To test this hypothesis, we analyzed the impact of different choices of linear and nonlinear architectures on the implicit bias of the learned features and the adversarial perturbations, in both spatial and frequency domains. We find that the high-frequency adversarial perturbations are critically dependent on the convolution operation because the spatially-limited nature of local convolutions induces an implicit bias towards high frequency features. The explanation for the latter involves the Fourier Uncertainty Principle: a spatially-limited (local in the space domain) filter cannot also be frequency-limited (local in the frequency domain). Furthermore, using larger convolution kernel sizes or avoiding convolutions (e.g. by using Vision Transformers architecture) significantly reduces this high frequency bias, but not the overall susceptibility to attacks. Looking forward, our work strongly suggests that understanding and controlling the implicit bias of architectures will be essential for achieving adversarial robustness.Comment: 20 pages, 11 figures, 12 Table

    Non-negative Least Squares via Overparametrization

    Full text link
    In many applications, solutions of numerical problems are required to be non-negative, e.g., when retrieving pixel intensity values or physical densities of a substance. In this context, non-negative least squares (NNLS) is a ubiquitous tool, e.g., when seeking sparse solutions of high-dimensional statistical problems. Despite vast efforts since the seminal work of Lawson and Hanson in the '70s, the non-negativity assumption is still an obstacle for the theoretical analysis and scalability of many off-the-shelf solvers. In the different context of deep neural networks, we recently started to see that the training of overparametrized models via gradient descent leads to surprising generalization properties and the retrieval of regularized solutions. In this paper, we prove that, by using an overparametrized formulation, NNLS solutions can reliably be approximated via vanilla gradient flow. We furthermore establish stability of the method against negative perturbations of the ground-truth. Our simulations confirm that this allows the use of vanilla gradient descent as a novel and scalable numerical solver for NNLS. From a conceptual point of view, our work proposes a novel approach to trading side-constraints in optimization problems against complexity of the optimization landscape, which does not build upon the concept of Lagrangian multipliers

    Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

    Full text link
    Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called `lazy' regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty
    • …
    corecore