1 research outputs found
Scaling Matters in Deep Structured-Prediction Models
Deep structured-prediction energy-based models combine the expressive power
of learned representations and the ability of embedding knowledge about the
task at hand into the system. A common way to learn parameters of such models
consists in a multistage procedure where different combinations of components
are trained at different stages. The joint end-to-end training of the whole
system is then done as the last fine-tuning stage. This multistage approach is
time-consuming and cumbersome as it requires multiple runs until convergence
and multiple rounds of hyperparameter tuning. From this point of view, it is
beneficial to start the joint training procedure from the beginning. However,
such approaches often unexpectedly fail and deliver results worse than the
multistage ones. In this paper, we hypothesize that one reason for joint
training of deep energy-based models to fail is the incorrect relative
normalization of different components in the energy function. We propose online
and offline scaling algorithms that fix the joint training and demonstrate
their efficacy on three different tasks.Comment: 13 page