6 research outputs found
Learning Dynamics in Linear VAE: Posterior Collapse Threshold, Superfluous Latent Space Pitfalls, and Speedup with KL Annealing
Variational autoencoders (VAEs) face a notorious problem wherein the
variational posterior often aligns closely with the prior, a phenomenon known
as posterior collapse, which hinders the quality of representation learning. To
mitigate this problem, an adjustable hyperparameter and a strategy for
annealing this parameter, called KL annealing, are proposed. This study
presents a theoretical analysis of the learning dynamics in a minimal VAE. It
is rigorously proved that the dynamics converge to a deterministic process
within the limit of large input dimensions, thereby enabling a detailed
dynamical analysis of the generalization error. Furthermore, the analysis shows
that the VAE initially learns entangled representations and gradually acquires
disentangled representations. A fixed-point analysis of the deterministic
process reveals that when exceeds a certain threshold, posterior
collapse becomes inevitable regardless of the learning period. Additionally,
the superfluous latent variables for the data-generative factors lead to
overfitting of the background noise; this adversely affects both generalization
and learning convergence. The analysis further unveiled that appropriately
tuned KL annealing can accelerate convergence.Comment: 24 pages, 5 figure
Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models
We analyze the dynamics of streaming stochastic gradient descent (SGD) in the
high-dimensional limit when applied to generalized linear models and
multi-index models (e.g. logistic regression, phase retrieval) with general
data-covariance. In particular, we demonstrate a deterministic equivalent of
SGD in the form of a system of ordinary differential equations that describes a
wide class of statistics, such as the risk and other measures of
sub-optimality. This equivalence holds with overwhelming probability when the
model parameter count grows proportionally to the number of data. This
framework allows us to obtain learning rate thresholds for stability of SGD as
well as convergence guarantees. In addition to the deterministic equivalent, we
introduce an SDE with a simplified diffusion coefficient (homogenized SGD)
which allows us to analyze the dynamics of general statistics of SGD iterates.
Finally, we illustrate this theory on some standard examples and show numerical
simulations which give an excellent match to the theory.Comment: Preliminary versio