Search CORE

18 research outputs found

Invariances and Data Augmentation for Supervised Music Transcription

Author: Foster Dean
Harchaoui Zaid
Kakade Sham M.
Thickstun John
Publication venue
Publication date: 13/11/2017
Field of study

This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.Comment: 6 page

arXiv.org e-Print Archive

Crossref

Robust Distortion-free Watermarks for Language Models

Author: Hashimoto Tatsunori
Kuditipudi Rohith
Liang Percy
Thickstun John
Publication venue
Publication date: 28/07/2023
Field of study

We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text (

p \leq 0.01

) from

35

tokens even after corrupting between

40

50

\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around

25\%

of the responses -- whose median length is around

100

tokens -- are detectable with

p \leq 0.01

, and the watermark is also less robust to certain automated paraphrasing attacks we implement

arXiv.org e-Print Archive

MAUVE Scores for Generative Models: Theory and Practice

Author: Choi Yejin
Harchaoui Zaid
Liu Lang
Oh Sewoong
Pillutla Krishna
Swayamdipta Swabha
Thickstun John
Welleck Sean
Zellers Rowan
Publication venue
Publication date: 07/12/2023
Field of study

Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of

f

-divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities.Comment: Published in Journal of Machine Learning Researc

arXiv.org e-Print Archive

Evaluating Human-Language Model Interaction

Author: Bernstein Michael
Bommasani Rishi
Cao Hancheng
Durmus Esin
Gerard-Ursin Ines
Hardy Amelia
Kwon Minae
Ladhak Faisal
Lee Mina
Lee Tony
Li Xiang Lisa
Liang Percy
Paranjape Ashwin
Park Joon Sung
Rong Frieda
Srivastava Megha
Thickstun John
Wang Rose E.
Publication venue
Publication date: 10/09/2023
Field of study

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI

arXiv.org e-Print Archive