17 research outputs found

    Invariances and Data Augmentation for Supervised Music Transcription

    Full text link
    This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.Comment: 6 page

    Robust Distortion-free Watermarks for Language Models

    Full text link
    We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (p≤0.01p \leq 0.01) from 3535 tokens even after corrupting between 4040-5050\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around 25%25\% of the responses -- whose median length is around 100100 tokens -- are detectable with p≤0.01p \leq 0.01, and the watermark is also less robust to certain automated paraphrasing attacks we implement

    Evaluating Human-Language Model Interaction

    Full text link
    Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI

    Leveraging Generative Models for Music and Signal Processing

    No full text
    Thesis (Ph.D.)--University of Washington, 2021Generative models can serve as a powerful primitive for creative interaction with data. Generative models give us the ability to synthesize or re-synthesize multimedia; conditional generative modeling empowers us to control the outputs of these models. By steering a generative model with conditioning information, we can sketch the essential aspects of our creative vision, and the generative model will fill in the details. This dissertation explores the possibilities of generative modeling as a creative tool, with a focus on applications to music and audio. The dissertation proceeds in three parts: 1. We develop algorithms and evaluation metrics for aligning musical scores to audio. Alignments provide us with a dense set of labels on musical audio, that we can use to supervise conditional generation tasks such as transcription: synthesis of a musical score conditioned on an audio performance. This work on alignments leads to the construction of MusicNet: a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note. We use this dataset to train state-of-the-art music transcription models for the MIREX Multiple Fundamental Frequency Estimation task. 2. We construct autoregressive generative models of musical scores, which exploit invariances in the structure of music. Whereas most recent work on music modeling has represented music as an ordered sequence of notes, we explore an alternative representation of music as a multi-dimensional tensor. We consider a variety of factorizations of the joint distribution over these tensors. We then turn to our attention to discriminative modeling of scores, using this tensor representation. We construct a classifier that can reliably identify the composer of a classical musical score. Our methods, which operate on the generic tensor score representation, outperform previously reported results using SVM and kNN classifiers with handcrafted features, specialized for the composer classification task. 3. We develop a sampling algorithm for likelihood-based models that allows us to steer an unconditional generative model using conditioning information. We work within a Bayesian posterior sampling framework, using a pre-trained unconditional generative model as a prior, to sample from the posterior distribution of a conditional likelihood. Samples are obtained using noise-annealed Langevin dynamics to construct a Markov chain to approximate a sample from the posterior distribution. We develop these ideas for a variety of models and applications, including source separation, in both the visual and audio domains
    corecore