Search CORE

34 research outputs found

Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution

Author: Ermon Stefano
Lou Aaron
Meng Chenlin
Publication venue
Publication date: 25/10/2023
Field of study

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel discrete score matching loss that is more stable than existing methods, forms an ELBO for maximum likelihood training, and can be efficiently optimized with a denoising variant. We scale our Score Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2, achieving highly competitive likelihoods while also introducing distinct algorithmic advantages. In particular, when comparing similarly sized SEDD and GPT-2 models, SEDD attains comparable perplexities (normally within

+10\%

of and sometimes outperforming the baseline). Furthermore, SEDD models learn a more faithful sequence distribution (around

4\times

better compared to GPT-2 models with ancestral sampling as measured by large models), can trade off compute for generation quality (needing only

16\times

fewer network evaluations to match GPT-2), and enables arbitrary infilling beyond the standard left to right prompting.Comment: 30 page

arXiv.org e-Print Archive

Concrete Score Matching: Generalized Score Matching for Discrete Data

Author: Choi Kristy
Ermon Stefano
Meng Chenlin
Song Jiaming
Publication venue
Publication date: 01/11/2022
Field of study

Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the "Concrete score", a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and high-dimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data.Comment: First two authors contributed equall

arXiv.org e-Print Archive

Generalizing Bayesian Optimization with Decision-theoretic Entropies

Author: Ermon Stefano
Meng Chenlin
Neiswanger Willie
Yu Lantao
Zhao Shengjia
Publication venue
Publication date: 04/10/2022
Field of study

Bayesian optimization (BO) is a popular method for efficiently inferring optima of an expensive black-box function via a sequence of queries. Existing information-theoretic BO procedures aim to make queries that most reduce the uncertainty about optima, where the uncertainty is captured by Shannon entropy. However, an optimal measure of uncertainty would, ideally, factor in how we intend to use the inferred quantity in some downstream procedure. In this paper, we instead consider a generalization of Shannon entropy from work in statistical decision theory (DeGroot 1962, Rao 1984), which contains a broad class of uncertainty measures parameterized by a problem-specific loss function corresponding to a downstream task. We first show that special cases of this entropy lead to popular acquisition functions used in BO procedures such as knowledge gradient, expected improvement, and entropy search. We then show how alternative choices for the loss yield a flexible family of acquisition functions that can be customized for use in novel optimization settings. Additionally, we develop gradient-based methods to efficiently optimize our proposed family of acquisition functions, and demonstrate strong empirical performance on a diverse set of sequential decision making tasks, including variants of top-

k

optimization, multi-level set estimation, and sequence search.Comment: Appears in Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022

arXiv.org e-Print Archive