17 research outputs found
Invariances and Data Augmentation for Supervised Music Transcription
This paper explores a variety of models for frame-based music transcription,
with an emphasis on the methods needed to reach state-of-the-art on human
recordings. The translation-invariant network discussed in this paper, which
combines a traditional filterbank with a convolutional neural network, was the
top-performing model in the 2017 MIREX Multiple Fundamental Frequency
Estimation evaluation. This class of models shares parameters in the
log-frequency domain, which exploits the frequency invariance of music to
reduce the number of model parameters and avoid overfitting to the training
data. All models in this paper were trained with supervision by labeled data
from the MusicNet dataset, augmented by random label-preserving pitch-shift
transformations.Comment: 6 page
Robust Distortion-free Watermarks for Language Models
We propose a methodology for planting watermarks in text from an
autoregressive language model that are robust to perturbations without changing
the distribution over text up to a certain maximum generation budget. We
generate watermarked text by mapping a sequence of random numbers -- which we
compute using a randomized watermark key -- to a sample from the language
model. To detect watermarked text, any party who knows the key can align the
text to the random number sequence. We instantiate our watermark methodology
with two sampling schemes: inverse transform sampling and exponential minimum
sampling. We apply these watermarks to three language models -- OPT-1.3B,
LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power
and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B
and LLaMA-7B models, we find we can reliably detect watermarked text () from tokens even after corrupting between -\% of the tokens
via random edits (i.e., substitutions, insertions or deletions). For the
Alpaca-7B model, we conduct a case study on the feasibility of watermarking
responses to typical user instructions. Due to the lower entropy of the
responses, detection is more difficult: around of the responses -- whose
median length is around tokens -- are detectable with , and
the watermark is also less robust to certain automated paraphrasing attacks we
implement
Evaluating Human-Language Model Interaction
Many real-world applications of language models (LMs), such as writing
assistance and code autocomplete, involve human-LM interaction. However, most
benchmarks are non-interactive in that a model produces output without human
involvement. To evaluate human-LM interaction, we develop a new framework,
Human-AI Language-based Interaction Evaluation (HALIE), that defines the
components of interactive systems and dimensions to consider when designing
evaluation metrics. Compared to standard, non-interactive evaluation, HALIE
captures (i) the interactive process, not only the final output; (ii) the
first-person subjective experience, not just a third-party assessment; and
(iii) notions of preference beyond quality (e.g., enjoyment and ownership). We
then design five tasks to cover different forms of interaction: social
dialogue, question answering, crossword puzzles, summarization, and metaphor
generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3
and AI21 Labs' Jurassic-1), we find that better non-interactive performance
does not always translate to better human-LM interaction. In particular, we
highlight three cases where the results from non-interactive and interactive
metrics diverge and underscore the importance of human-LM interaction for LM
evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at
the Stanford Institute for Human-Centered Artificial Intelligence (HAI
Leveraging Generative Models for Music and Signal Processing
Thesis (Ph.D.)--University of Washington, 2021Generative models can serve as a powerful primitive for creative interaction with data. Generative models give us the ability to synthesize or re-synthesize multimedia; conditional generative modeling empowers us to control the outputs of these models. By steering a generative model with conditioning information, we can sketch the essential aspects of our creative vision, and the generative model will fill in the details. This dissertation explores the possibilities of generative modeling as a creative tool, with a focus on applications to music and audio. The dissertation proceeds in three parts: 1. We develop algorithms and evaluation metrics for aligning musical scores to audio. Alignments provide us with a dense set of labels on musical audio, that we can use to supervise conditional generation tasks such as transcription: synthesis of a musical score conditioned on an audio performance. This work on alignments leads to the construction of MusicNet: a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note. We use this dataset to train state-of-the-art music transcription models for the MIREX Multiple Fundamental Frequency Estimation task. 2. We construct autoregressive generative models of musical scores, which exploit invariances in the structure of music. Whereas most recent work on music modeling has represented music as an ordered sequence of notes, we explore an alternative representation of music as a multi-dimensional tensor. We consider a variety of factorizations of the joint distribution over these tensors. We then turn to our attention to discriminative modeling of scores, using this tensor representation. We construct a classifier that can reliably identify the composer of a classical musical score. Our methods, which operate on the generic tensor score representation, outperform previously reported results using SVM and kNN classifiers with handcrafted features, specialized for the composer classification task. 3. We develop a sampling algorithm for likelihood-based models that allows us to steer an unconditional generative model using conditioning information. We work within a Bayesian posterior sampling framework, using a pre-trained unconditional generative model as a prior, to sample from the posterior distribution of a conditional likelihood. Samples are obtained using noise-annealed Langevin dynamics to construct a Markov chain to approximate a sample from the posterior distribution. We develop these ideas for a variety of models and applications, including source separation, in both the visual and audio domains