34 research outputs found
Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution
Despite their groundbreaking performance for many generative modeling tasks,
diffusion models have fallen short on discrete data domains such as natural
language. Crucially, standard diffusion models rely on the well-established
theory of score matching, but efforts to generalize this to discrete structures
have not yielded the same empirical gains. In this work, we bridge this gap by
proposing score entropy, a novel discrete score matching loss that is more
stable than existing methods, forms an ELBO for maximum likelihood training,
and can be efficiently optimized with a denoising variant. We scale our Score
Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2,
achieving highly competitive likelihoods while also introducing distinct
algorithmic advantages. In particular, when comparing similarly sized SEDD and
GPT-2 models, SEDD attains comparable perplexities (normally within of
and sometimes outperforming the baseline). Furthermore, SEDD models learn a
more faithful sequence distribution (around better compared to GPT-2
models with ancestral sampling as measured by large models), can trade off
compute for generation quality (needing only fewer network
evaluations to match GPT-2), and enables arbitrary infilling beyond the
standard left to right prompting.Comment: 30 page
Concrete Score Matching: Generalized Score Matching for Discrete Data
Representing probability distributions by the gradient of their density
functions has proven effective in modeling a wide range of continuous data
modalities. However, this representation is not applicable in discrete domains
where the gradient is undefined. To this end, we propose an analogous score
function called the "Concrete score", a generalization of the (Stein) score for
discrete settings. Given a predefined neighborhood structure, the Concrete
score of any input is defined by the rate of change of the probabilities with
respect to local directional changes of the input. This formulation allows us
to recover the (Stein) score in continuous domains when measuring such changes
by the Euclidean distance, while using the Manhattan distance leads to our
novel score function in discrete domains. Finally, we introduce a new framework
to learn such scores from samples called Concrete Score Matching (CSM), and
propose an efficient training objective to scale our approach to high
dimensions. Empirically, we demonstrate the efficacy of CSM on density
estimation tasks on a mixture of synthetic, tabular, and high-dimensional image
datasets, and demonstrate that it performs favorably relative to existing
baselines for modeling discrete data.Comment: First two authors contributed equall
Generalizing Bayesian Optimization with Decision-theoretic Entropies
Bayesian optimization (BO) is a popular method for efficiently inferring
optima of an expensive black-box function via a sequence of queries. Existing
information-theoretic BO procedures aim to make queries that most reduce the
uncertainty about optima, where the uncertainty is captured by Shannon entropy.
However, an optimal measure of uncertainty would, ideally, factor in how we
intend to use the inferred quantity in some downstream procedure. In this
paper, we instead consider a generalization of Shannon entropy from work in
statistical decision theory (DeGroot 1962, Rao 1984), which contains a broad
class of uncertainty measures parameterized by a problem-specific loss function
corresponding to a downstream task. We first show that special cases of this
entropy lead to popular acquisition functions used in BO procedures such as
knowledge gradient, expected improvement, and entropy search. We then show how
alternative choices for the loss yield a flexible family of acquisition
functions that can be customized for use in novel optimization settings.
Additionally, we develop gradient-based methods to efficiently optimize our
proposed family of acquisition functions, and demonstrate strong empirical
performance on a diverse set of sequential decision making tasks, including
variants of top- optimization, multi-level set estimation, and sequence
search.Comment: Appears in Proceedings of the 36th Conference on Neural Information
Processing Systems (NeurIPS 2022