532 research outputs found
Distributed Learning for Stochastic Generalized Nash Equilibrium Problems
This work examines a stochastic formulation of the generalized Nash
equilibrium problem (GNEP) where agents are subject to randomness in the
environment of unknown statistical distribution. We focus on fully-distributed
online learning by agents and employ penalized individual cost functions to
deal with coupled constraints. Three stochastic gradient strategies are
developed with constant step-sizes. We allow the agents to use heterogeneous
step-sizes and show that the penalty solution is able to approach the Nash
equilibrium in a stable manner within , for small step-size
value and sufficiently large penalty parameters. The operation
of the algorithm is illustrated by considering the network Cournot competition
problem
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Foundation models are first pre-trained on vast unsupervised datasets and
then fine-tuned on labeled data. Reinforcement learning, notably from human
feedback (RLHF), can further align the network with the intended usage. Yet the
imperfections in the proxy reward may hinder the training and lead to
suboptimal results; the diversity of objectives in real-world tasks and human
opinions exacerbate the issue. This paper proposes embracing the heterogeneity
of diverse rewards by following a multi-policy strategy. Rather than focusing
on a single a priori reward, we aim for Pareto-optimal generalization across
the entire space of preferences. To this end, we propose rewarded soup, first
specializing multiple networks independently (one for each proxy reward) and
then interpolating their weights linearly. This succeeds empirically because we
show that the weights remain linearly connected when fine-tuned on diverse
rewards from a shared pre-trained initialization. We demonstrate the
effectiveness of our approach for text-to-text (summarization, Q&A, helpful
assistant, review), text-image (image captioning, text-to-image generation,
visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the
alignment of deep models, and how they interact with the world in all its
diversity
Knowledge Distillation Performs Partial Variance Reduction
Knowledge distillation is a popular approach for enhancing the performance of
``student'' models, with lower representational capacity, by taking advantage
of more powerful ``teacher'' models. Despite its apparent simplicity and
widespread use, the underlying mechanics behind knowledge distillation (KD) are
still not fully understood. In this work, we shed new light on the inner
workings of this method, by examining it from an optimization perspective. We
show that, in the context of linear and deep linear models, KD can be
interpreted as a novel type of stochastic variance reduction mechanism. We
provide a detailed convergence analysis of the resulting dynamics, which hold
under standard assumptions for both strongly-convex and non-convex losses,
showing that KD acts as a form of \emph{partial variance reduction}, which can
reduce the stochastic gradient noise, but may not eliminate it completely,
depending on the properties of the ``teacher'' model. Our analysis puts further
emphasis on the need for careful parametrization of KD, in particular w.r.t.
the weighting of the distillation loss, and is validated empirically on both
linear models and deep neural networks.Comment: 36 page
- …