Sparse annotation poses persistent challenges to training dense retrieval
models; for example, it distorts the training signal when unlabeled relevant
documents are used spuriously as negatives in contrastive learning. To
alleviate this problem, we introduce evidence-based label smoothing, a novel,
computationally efficient method that prevents penalizing the model for
assigning high relevance to false negatives. To compute the target relevance
distribution over candidate documents within the ranking context of a given
query, we assign a non-zero relevance probability to those candidates most
similar to the ground truth based on the degree of their similarity to the
ground-truth document(s).
To estimate relevance we leverage an improved similarity metric based on
reciprocal nearest neighbors, which can also be used independently to rerank
candidates in post-processing. Through extensive experiments on two large-scale
ad hoc text retrieval datasets, we demonstrate that reciprocal nearest
neighbors can improve the ranking effectiveness of dense retrieval models, both
when used for label smoothing, as well as for reranking. This indicates that by
considering relationships between documents and queries beyond simple geometric
distance we can effectively enhance the ranking context.Comment: EMNLP 202