5,119 research outputs found
DebCSE: Rethinking Unsupervised Contrastive Sentence Embedding Learning in the Debiasing Perspective
Several prior studies have suggested that word frequency biases can cause the
Bert model to learn indistinguishable sentence embeddings. Contrastive learning
schemes such as SimCSE and ConSERT have already been adopted successfully in
unsupervised sentence embedding to improve the quality of embeddings by
reducing this bias. However, these methods still introduce new biases such as
sentence length bias and false negative sample bias, that hinders model's
ability to learn more fine-grained semantics. In this paper, we reexamine the
challenges of contrastive sentence embedding learning from a debiasing
perspective and argue that effectively eliminating the influence of various
biases is crucial for learning high-quality sentence embeddings. We think all
those biases are introduced by simple rules for constructing training data in
contrastive learning and the key for contrastive learning sentence embedding is
to mimic the distribution of training data in supervised machine learning in
unsupervised way. We propose a novel contrastive framework for sentence
embedding, termed DebCSE, which can eliminate the impact of these biases by an
inverse propensity weighted sampling method to select high-quality positive and
negative pairs according to both the surface and semantic similarity between
sentences. Extensive experiments on semantic textual similarity (STS)
benchmarks reveal that DebCSE significantly outperforms the latest
state-of-the-art models with an average Spearman's correlation coefficient of
80.33% on BERTbase
- …