115 research outputs found
Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
While large language models have proven effective in a huge range of
downstream applications, they often generate text that is problematic or lacks
a desired attribute. In this paper, we introduce Reward-Augmented Decoding
(RAD), a text generation procedure that uses a small unidirectional reward
model to encourage a language model to generate text that has certain
properties. Specifically, RAD uses the reward model to score generations as
they are produced and rescales sampling probabilities to favor high-reward
tokens. By using a unidirectional reward model, RAD can cache activations from
prior generation steps to decrease computational overhead. Through experiments
on generating non-toxic and sentiment-controlled text, we demonstrate that RAD
performs best among methods that change only the generation procedure and
matches the performance of state-of-the-art methods that involve re-training
the language model. We further validate that RAD is effective on very large
language models while incurring a minimal computational overhead
Merging by Matching Models in Task Parameter Subspaces
Model merging aims to cheaply combine individual task-specific models into a
single multitask model. In this work, we view past merging methods as
leveraging different notions of a ''task parameter subspace'' in which models
are matched before being merged. We connect the task parameter subspace of a
given model to its loss landscape and formalize how this approach to model
merging can be seen as solving a linear system of equations. While past work
has generally been limited to linear systems that have a closed-form solution,
we consider using the conjugate gradient method to find a solution. We show
that using the conjugate gradient method can outperform closed-form solutions,
enables merging via linear systems that are otherwise intractable to solve, and
flexibly allows choosing from a wide variety of initializations and estimates
for the ''task parameter subspace''. We ultimately demonstrate that our merging
framework called ''Matching Models in their Task Parameter Subspace'' (MaTS)
achieves state-of-the-art results in multitask and intermediate-task model
merging. We release all of the code and checkpoints used in our work at
https://github.com/r-three/mats.Comment: TML
Soft Merging of Experts with Adaptive Routing
Sparsely activated neural networks with conditional computation learn to
route their inputs through different "expert" subnetworks, providing a form of
modularity that densely activated models lack. Despite their possible benefits,
models with learned routing often underperform their parameter-matched densely
activated counterparts as well as models that use non-learned heuristic routing
strategies. In this paper, we hypothesize that these shortcomings stem from the
gradient estimation techniques used to train sparsely activated models that use
non-differentiable discrete routing decisions. To address this issue, we
introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids
discrete routing by using a single "merged" expert constructed via a weighted
average of all of the experts' parameters. By routing activations through a
single merged expert, SMEAR does not incur a significant increase in
computational costs and enables standard gradient-based training. We
empirically validate that models using SMEAR outperform models that route based
on metadata or learn sparse routing through gradient estimation. Furthermore,
we provide qualitative analysis demonstrating that the experts learned via
SMEAR exhibit a significant amount of specialization. All of the code used in
our experiments is publicly available
Recommended from our members
Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching
Sequences of feature vectors are a natural way of representing temporal data. Given a database of sequences, a fundamental task is to find the database entry which is the most similar to a query. In this thesis, we present learning-based methods for efficiently and accurately comparing sequences in order to facilitate large-scale sequence search. Throughout, we will focus on the problem of matching MIDI files (a digital score format) to a large collection of audio recordings of music. The combination of our proposed approaches enables us to create the largest corpus of paired MIDI files and audio recordings ever assembled.
Dynamic time warping (DTW) has proven to be an extremely effective method for both aligning and matching sequences. However, its performance is heavily affected by factors such as the feature representation used and its adjustable parameters. We therefore investigate automatically optimizing DTW-based alignment and matching of MIDI and audio data. Our approach uses Bayesian optimization to tune system design and parameters over a synthetically-created dataset of audio and MIDI pairs. We then perform an exhaustive search over DTW score normalization techniques to find the optimal method for reporting a reliable alignment confidence score, as required in matching tasks. This results in a DTW-based system which is conceptually simple and highly accurate at both alignment and matching. We also verify that this system achieves high performance in a large-scale qualitative evaluation of real-world alignments.
Unfortunately, DTW can be far too inefficient for large-scale search when sequences are very long and consist of high-dimensional feature vectors. We therefore propose a method for mapping sequences of continuously-valued feature vectors to downsampled sequences of binary vectors. Our approach involves training a pair of convolutional networks to map paired groups of subsequent feature vectors to a Hamming space where similarity is preserved. Evaluated on the task of matching MIDI files to a large database of audio recordings, we show that this technique enables 99.99\% of the database to be discarded with a modest false reject rate while only requiring 0.2\% of the time to compute.
Even when sped-up with a more efficient representation, the quadratic complexity of DTW greatly hinders its feasibility for very large-scale search. This cost can be avoided by mapping entire sequences to fixed-length vectors in an embedded space where sequence similarity is approximated by Euclidean distance. To achieve this embedding, we propose a feed-forward attention-based neural network model which can integrate arbitrarily long sequences. We show that this approach can extremely efficiently prune 90\% of our audio recording database with high confidence.
After developing these approaches, we applied them together to the practical task of matching 178,561 unique MIDI files to the Million Song Dataset. The resulting ``Lakh MIDI Dataset'' provides a potential bounty of ground truth information for audio content-based music information retrieval. This can include transcription, meter, lyrics, and high-level musicological features. The reliability of the resulting annotations depends both on the quality of the transcription and the accuracy of the score-to-audio alignment. We therefore establish a baseline of reliability for score-derived information for different content-based MIR tasks. Finally, we discuss potential future uses of our dataset and the learning-based sequence comparison methods we developed
- …