248 research outputs found
Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods
We formulate the problem of neural network optimization as Bayesian
filtering, where the observations are the backpropagated gradients. While
neural network optimization has previously been studied using natural gradient
methods which are closely related to Bayesian inference, they were unable to
recover standard optimizers such as Adam and RMSprop with a root-mean-square
gradient normalizer, instead getting a mean-square normalizer. To recover the
root-mean-square normalizer, we find it necessary to account for the temporal
dynamics of all the other parameters as they are geing optimized. The resulting
optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like
behaviour, automatically recovers AdamW, a state of the art variant of Adam
with decoupled weight decay, and has generalisation performance competitive
with SGD
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Massive transformer-based models face several challenges, including slow and
computationally intensive pre-training and over-parametrization. This paper
addresses these challenges by proposing a versatile method called GQKVA, which
generalizes query, key, and value grouping techniques. GQKVA is designed to
speed up transformer pre-training while reducing the model size. Our
experiments with various GQKVA variants highlight a clear trade-off between
performance and model size, allowing for customized choices based on resource
and time limitations. Our findings also indicate that the conventional
multi-head attention approach is not always the best choice, as there are
lighter and faster alternatives available. We tested our method on ViT, which
achieved an approximate 0.3% increase in accuracy while reducing the model size
by about 4% in the task of image classification. Additionally, our most
aggressive model reduction experiment resulted in a reduction of approximately
15% in model size, with only around a 1% drop in accuracy
- …