1,718 research outputs found
Stochastic Weight Averaging Revisited
Averaging neural network weights sampled by a backbone stochastic gradient
descent (SGD) is a simple yet effective approach to assist the backbone SGD in
finding better optima, in terms of generalization. From a statistical
perspective, weight averaging (WA) contributes to variance reduction. Recently,
a well-established stochastic weight averaging (SWA) method is proposed, which
is featured by the application of a cyclical or high constant (CHC) learning
rate schedule (LRS) in generating weight samples for WA. Then a new insight on
WA appears, which states that WA helps to discover wider optima and then leads
to better generalization. We conduct extensive experimental studies for SWA,
involving a dozen modern DNN model structures and a dozen benchmark open-source
image, graph, and text datasets. We disentangle contributions of the WA
operation and the CHC LRS for SWA, showing that the WA operation in SWA still
contributes to variance reduction but does not always lead to wide optima. The
experimental results indicate that there are global scale geometric structures
in the DNN loss landscape. We then present an algorithm termed periodic SWA
(PSWA) which makes use of a series of WA operations to discover the global
geometric structures. PSWA outperforms its backbone SGD remarkably, providing
experimental evidences for the existence of global geometric structures. Codes
for reproducing the experimental results are available at
https://github.com/ZJLAB-AMMI/PSWA
- …