We investigate the learning dynamics of fully-connected neural networks
through the lens of gradient signal-to-noise ratio (SNR), examining the
behavior of first-order optimizers like Adam in non-convex objectives. By
interpreting the drift/diffusion phases in the information bottleneck theory,
focusing on gradient homogeneity, we identify a third phase termed ``total
diffusion", characterized by equilibrium in the learning rates and homogeneous
gradients. This phase is marked by an abrupt SNR increase, uniform residuals
across the sample space and the most rapid training convergence. We propose a
residual-based re-weighting scheme to accelerate this diffusion in quadratic
loss functions, enhancing generalization. We also explore the information
compression phenomenon, pinpointing a significant saturation-induced
compression of activations at the total diffusion phase, with deeper layers
experiencing negligible information loss. Supported by experimental data on
physics-informed neural networks (PINNs), which underscore the importance of
gradient homogeneity due to their PDE-based sample inter-dependence, our
findings suggest that recognizing phase transitions could refine ML
optimization strategies for improved generalization