383 research outputs found
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning
Latent structure models are a powerful tool for modeling language data: they
can mitigate the error propagation and annotation bottleneck in pipeline
systems, while simultaneously uncovering linguistic insights about the data.
One challenge with end-to-end training of these models is the argmax operation,
which has null gradient. In this paper, we focus on surrogate gradients, a
popular strategy to deal with this problem. We explore latent structure
learning through the angle of pulling back the downstream learning objective.
In this paradigm, we discover a principled motivation for both the
straight-through estimator (STE) as well as the recently-proposed SPIGOT - a
variant of STE for structured models. Our perspective leads to new algorithms
in the same family. We empirically compare the known and the novel pulled-back
estimators against the popular alternatives, yielding new insight for
practitioners and revealing intriguing failure cases.Comment: EMNLP 202
Adversarial Generation of Natural Language
Generative Adversarial Networks (GANs) have gathered a lot of attention from
the computer vision community, yielding impressive results for image
generation. Advances in the adversarial generation of natural language from
noise however are not commensurate with the progress made in generating images,
and still lag far behind likelihood based methods. In this paper, we take a
step towards generating natural language with a GAN objective alone. We
introduce a simple baseline that addresses the discrete output space problem
without relying on gradient estimators and show that it is able to achieve
state-of-the-art results on a Chinese poem generation dataset. We present
quantitative results on generating sentences from context-free and
probabilistic context-free grammars, and qualitative language modeling results.
A conditional version is also described that can generate sequences conditioned
on sentence characteristics.Comment: 11 pages, 3 figures, 5 table
On the Interpretability of Attention Networks
Attention mechanisms form a core component of several successful deep
learning architectures, and are based on one key idea: ''The output depends
only on a small (but unknown) segment of the input.'' In several practical
applications like image captioning and language translation, this is mostly
true. In trained models with an attention mechanism, the outputs of an
intermediate module that encodes the segment of input responsible for the
output is often used as a way to peek into the `reasoning` of the network. We
make such a notion more precise for a variant of the classification problem
that we term selective dependence classification (SDC) when used with attention
model architectures. Under such a setting, we demonstrate various error modes
where an attention model can be accurate but fail to be interpretable, and show
that such models do occur as a result of training. We illustrate various
situations that can accentuate and mitigate this behaviour. Finally, we use our
objective definition of interpretability for SDC tasks to evaluate a few
attention model learning algorithms designed to encourage sparsity and
demonstrate that these algorithms help improve interpretability.Comment: ACML 2022, proceedings to be appeared in PMLR, Volume 18
Recommended from our members
Modeling the Multi-mode Distribution in Self-Supervised Language Models
Self-supervised large language models (LMs) have become a highly-influential and foundational tool for many NLP models. For this reason, their expressivity is an important topic of study. In near-universal practice, given the language context, the model predicts a word from the vocabulary using a single embedded vector representation of both context and dictionary entries. Note that the context sometimes implies that the distribution over predicted words should be multi-modal in embedded space. However, the context’s single-vector representation provably fails to capture such a distribution. To address this limitation, we propose to represent context with multiple vector embeddings, which we term facets. This is distinct from previous work on multi-sense vocabulary embeddings, which employs multiple vectors for the dictionary entries, not the context.
In this dissertation, we first present the theoretical limitations of the single context embedding in LMs and how the theoretical analyses suggest new alternative softmax layers that encode a context as multiple embeddings. The proposed alternatives achieve better perplexity than the mixture of softmax (MoS), especially given an ambiguous context, without adding significant computational cost to LMs. Our approaches also let GPT-2 learn to properly copy the entities from the context, which increases the coherence of the generated text without requiring any labels.
In addition to predicting the next word, we also use multiple CLS embeddings to improve state-of-the-art pretraining methods for BERT on natural language understanding (NLU) benchmarks without introducing significant extra parameters or computations, especially when the training datasets are small. Furthermore, we show that our multi-facet embeddings improve the sequential recommendation, scientific paper embeddings, measurement of sentence similarity, distantly supervised relation extraction, unsupervised text pattern entailment detection, and cold-start citation recommendation. Finally, we use the multiple vector embeddings to predict the future topics of a context, and build on the basis, we propose a novel interactive language generation framework
- …