2,552 research outputs found
Implicit Gradient Regularization
Gradient descent can be surprisingly good at optimizing deep neural networks
without overfitting and without explicit regularization. We find that the
discrete steps of gradient descent implicitly regularize models by penalizing
gradient descent trajectories that have large loss gradients. We call this
Implicit Gradient Regularization (IGR) and we use backward error analysis to
calculate the size of this regularization. We confirm empirically that implicit
gradient regularization biases gradient descent toward flat minima, where test
errors are small and solutions are robust to noisy parameter perturbations.
Furthermore, we demonstrate that the implicit gradient regularization term can
be used as an explicit regularizer, allowing us to control this gradient
regularization directly. More broadly, our work indicates that backward error
analysis is a useful theoretical approach to the perennial question of how
learning rate, model size, and parameter regularization interact to determine
the properties of overparameterized models optimized with gradient descent
Why neural networks find simple solutions: the many regularizers of geometric complexity
In many contexts, simpler models are preferable to more complex models and
the control of this model complexity is the goal for many methods in machine
learning such as regularization, hyperparameter tuning and architecture design.
In deep learning, it has been difficult to understand the underlying mechanisms
of complexity control, since many traditional measures are not naturally
suitable for deep neural networks. Here we develop the notion of geometric
complexity, which is a measure of the variability of the model function,
computed using a discrete Dirichlet energy. Using a combination of theoretical
arguments and empirical results, we show that many common training heuristics
such as parameter norm regularization, spectral norm regularization, flatness
regularization, implicit gradient regularization, noise regularization and the
choice of parameter initialization all act to control geometric complexity,
providing a unifying framework in which to characterize the behavior of deep
learning models.Comment: Accepted as a NeurIPS 2022 pape
Distinct Quantum States Can Be Compatible with a Single State of Reality
Perhaps the quantum state represents information about reality, and not
reality directly. Wave function collapse is then possibly no more mysterious
than a Bayesian update of a probability distribution given new data. We
consider models for quantum systems with measurement outcomes determined by an
underlying physical state of the system but where several quantum states are
consistent with a single underlying state---i.e., probability distributions for
distinct quantum states overlap. Significantly, we demonstrate by example that
additional assumptions are always necessary to rule out such a model.Comment: 5 pages, 2 figure
Pseudospectral Calculation of the Wavefunction of Helium and the Negative Hydrogen Ion
We study the numerical solution of the non-relativistic Schr\"{o}dinger
equation for two-electron atoms in ground and excited S-states using
pseudospectral (PS) methods of calculation. The calculation achieves
convergence rates for the energy, Cauchy error in the wavefunction, and
variance in local energy that are exponentially fast for all practical
purposes. The method requires three separate subdomains to handle the
wavefunction's cusp-like behavior near the two-particle coalescences. The use
of three subdomains is essential to maintaining exponential convergence. A
comparison of several different treatments of the cusps and the semi-infinite
domain suggest that the simplest prescription is sufficient. For many purposes
it proves unnecessary to handle the logarithmic behavior near the
three-particle coalescence in a special way. The PS method has many virtues: no
explicit assumptions need be made about the asymptotic behavior of the
wavefunction near cusps or at large distances, the local energy is exactly
equal to the calculated global energy at all collocation points, local errors
go down everywhere with increasing resolution, the effective basis using
Chebyshev polynomials is complete and simple, and the method is easily
extensible to other bound states. This study serves as a proof-of-principle of
the method for more general two- and possibly three-electron applications.Comment: 23 pages, 20 figures, 2 tables, Final refereed version - Some
references added, some stylistic changes, added paragraph to matrix methods
section, added last sentence to abstract
- …