7 research outputs found
SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning
This study addresses the challenge of inaccurate gradients in computing the
empirical Fisher Information Matrix during neural network pruning. We introduce
SWAP, a formulation of Entropic Wasserstein regression (EWR) for pruning,
capitalizing on the geometric properties of the optimal transport problem. The
``swap'' of the commonly used linear regression with the EWR in optimization is
analytically demonstrated to offer noise mitigation effects by incorporating
neighborhood interpolation across data points with only marginal additional
computational cost. The unique strength of SWAP is its intrinsic ability to
balance noise reduction and covariance information preservation effectively.
Extensive experiments performed on various networks and datasets show
comparable performance of SWAP with state-of-the-art (SoTA) network pruning
algorithms. Our proposed method outperforms the SoTA when the network size or
the target sparsity is large, the gain is even larger with the existence of
noisy gradients, possibly from noisy data, analog memory, or adversarial
attacks. Notably, our proposed method achieves a gain of 6% improvement in
accuracy and 8% improvement in testing loss for MobileNetV1 with less than
one-fourth of the network parameters remaining.Comment: Published as a conference paper at ICLR 202
The Gaussian equivalence of generative models for learning with shallow neural networks
Understanding the impact of data structure on the computational tractability
of learning is a key challenge for the theory of neural networks. Many
theoretical works do not explicitly model training data, or assume that inputs
are drawn component-wise independently from some simple probability
distribution. Here, we go beyond this simple paradigm by studying the
performance of neural networks trained on data drawn from pre-trained
generative models. This is possible due to a Gaussian equivalence stating that
the key metrics of interest, such as the training and test errors, can be fully
captured by an appropriately chosen Gaussian model. We provide three strands of
rigorous, analytical and numerical evidence corroborating this equivalence.
First, we establish rigorous conditions for the Gaussian equivalence to hold in
the case of single-layer generative models, as well as deterministic rates for
convergence in distribution. Second, we leverage this equivalence to derive a
closed set of equations describing the generalisation performance of two widely
studied machine learning problems: two-layer neural networks trained using
one-pass stochastic gradient descent, and full-batch pre-learned features or
kernel methods. Finally, we perform experiments demonstrating how our theory
applies to deep, pre-trained generative models. These results open a viable
path to the theoretical study of machine learning models with realistic data.Comment: The accompanying code for this paper is available at
https://github.com/sgoldt/gaussian-equiv-2laye
A Framework for Statistical Inference via Randomized Algorithms
Randomized algorithms, such as randomized sketching or projections, are a
promising approach to ease the computational burden in analyzing large
datasets. However, randomized algorithms also produce non-deterministic
outputs, leading to the problem of evaluating their accuracy. In this paper, we
develop a statistical inference framework for quantifying the uncertainty of
the outputs of randomized algorithms. We develop appropriate statistical
methods -- sub-randomization, multi-run plug-in and multi-run aggregation
inference -- by using multiple runs of the same randomized algorithm, or by
estimating the unknown parameters of the limiting distribution. As an example,
we develop methods for statistical inference for least squares parameters via
random sketching using matrices with i.i.d.entries, or uniform partial
orthogonal matrices. For this, we characterize the limiting distribution of
estimators obtained via sketch-and-solve as well as partial sketching methods.
The analysis of i.i.d. sketches uses a trigonometric interpolation argument to
establish a differential equation for the limiting expected characteristic
function and find the dependence on the kurtosis of the entries of the
sketching matrix. The results are supported via a broad range of simulations
Divergence Measures
Data science, information theory, probability theory, statistical learning and other related disciplines greatly benefit from non-negative measures of dissimilarity between pairs of probability measures. These are known as divergence measures, and exploring their mathematical foundations and diverse applications is of significant interest. The present Special Issue, entitled “Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems”, includes eight original contributions, and it is focused on the study of the mathematical properties and applications of classical and generalized divergence measures from an information-theoretic perspective. It mainly deals with two key generalizations of the relative entropy: namely, the R_ényi divergence and the important class of f -divergences. It is our hope that the readers will find interest in this Special Issue, which will stimulate further research in the study of the mathematical foundations and applications of divergence measures