86 research outputs found
Contrastive Learning of Sentence Embeddings from Scratch
Contrastive learning has been the dominant approach to train state-of-the-art
sentence embeddings. Previous studies have typically learned sentence
embeddings either through the use of human-annotated natural language inference
(NLI) data or via large-scale unlabeled sentences in an unsupervised manner.
However, even in the case of unlabeled data, their acquisition presents
challenges in certain domains due to various reasons. To address these issues,
we present SynCSE, a contrastive learning framework that trains sentence
embeddings with synthesized data. Specifically, we explore utilizing large
language models to synthesize the required data samples for contrastive
learning, including (1) producing positive and negative annotations given
unlabeled sentences (SynCSE-partial), and (2) generating sentences along with
their corresponding annotations from scratch (SynCSE-scratch). Experimental
results on sentence similarity and reranking tasks indicate that both
SynCSE-partial and SynCSE-scratch greatly outperform unsupervised baselines,
and SynCSE-partial even achieves comparable performance to the supervised
models in most settings.Comment: Preprin
Efficient Correlated Topic Modeling with Topic Embedding
Correlated topic modeling has been limited to small model and problem sizes
due to their high computational cost and poor scaling. In this paper, we
propose a new model which learns compact topic embeddings and captures topic
correlations through the closeness between the topic vectors. Our method
enables efficient inference in the low-dimensional embedding space, reducing
previous cubic or quadratic time complexity to linear w.r.t the topic size. We
further speedup variational inference with a fast sampler to exploit sparsity
of topic occurrence. Extensive experiments show that our approach is capable of
handling model and data scales which are several orders of magnitude larger
than existing correlation results, without sacrificing modeling quality by
providing competitive or superior performance in document classification and
retrieval.Comment: KDD 2017 oral. The first two authors contributed equall
A Surprisingly Effective Fix for Deep Latent Variable Modeling of Text
When trained effectively, the Variational Autoencoder (VAE) is both a
powerful language model and an effective representation learning framework. In
practice, however, VAEs are trained with the evidence lower bound (ELBO) as a
surrogate objective to the intractable marginal data likelihood. This approach
to training yields unstable results, frequently leading to a disastrous local
optimum known as posterior collapse. In this paper, we investigate a simple fix
for posterior collapse which yields surprisingly effective results. The
combination of two known heuristics, previously considered only in isolation,
substantially improves held-out likelihood, reconstruction, and latent
representation learning when compared with previous state-of-the-art methods.
More interestingly, while our experiments demonstrate superiority on these
principle evaluations, our method obtains a worse ELBO. We use these results to
argue that the typical surrogate objective for VAEs may not be sufficient or
necessarily appropriate for balancing the goals of representation learning and
data distribution modeling.Comment: EMNLP 2019 short paper. The first two authors contributed equall
- …