140 research outputs found
A Weighted Correlation Index for Rankings with Ties
Understanding the correlation between two different scores for the same set
of items is a common problem in information retrieval, and the most commonly
used statistics that quantifies this correlation is Kendall's . However,
the standard definition fails to capture that discordances between items with
high rank are more important than those between items with low rank. Recently,
a new measure of correlation based on average precision has been proposed to
solve this problem, but like many alternative proposals in the literature it
assumes that there are no ties in the scores. This is a major deficiency in a
number of contexts, and in particular while comparing centrality scores on
large graphs, as the obvious baseline, indegree, has a very large number of
ties in web and social graphs. We propose to extend Kendall's definition in a
natural way to take into account weights in the presence of ties. We prove a
number of interesting mathematical properties of our generalization and
describe an algorithm for its computation. We also validate the
usefulness of our weighted measure of correlation using experimental data
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
Training Curricula for Open Domain Answer Re-Ranking
In precision-oriented tasks like answer ranking, it is more important to rank
many relevant answers highly than to retrieve all relevant answers. It follows
that a good ranking strategy would be to learn how to identify the easiest
correct answers first (i.e., assign a high ranking score to answers that have
characteristics that usually indicate relevance, and a low ranking score to
those with characteristics that do not), before incorporating more complex
logic to handle difficult cases (e.g., semantic matching or reasoning). In this
work, we apply this idea to the training of neural answer rankers using
curriculum learning. We propose several heuristics to estimate the difficulty
of a given training sample. We show that the proposed heuristics can be used to
build a training curriculum that down-weights difficult samples early in the
training process. As the training process progresses, our approach gradually
shifts to weighting all samples equally, regardless of difficulty. We present a
comprehensive evaluation of our proposed idea on three answer ranking datasets.
Results show that our approach leads to superior performance of two leading
neural ranking architectures, namely BERT and ConvKNRM, using both pointwise
and pairwise losses. When applied to a BERT-based ranker, our method yields up
to a 4% improvement in MRR and a 9% improvement in P@1 (compared to the model
trained without a curriculum). This results in models that can achieve
comparable performance to more expensive state-of-the-art techniques.Comment: Accepted at SIGIR 2020 (long
Large language models can accurately predict searcher preferences
Relevance labels, which indicate whether a search result is valuable to a
searcher, are key to evaluating and optimising search systems. The best way to
capture the true preferences of users is to ask them for their careful feedback
on which results would be useful, but this approach does not scale to produce a
large number of labels. Getting relevance labels at scale is usually done with
third-party labellers, who judge on behalf of the user, but there is a risk of
low-quality data if the labeller doesn't understand user needs. To improve
quality, one standard approach is to study real users through interviews, user
studies and direct feedback, find areas where labels are systematically
disagreeing with users, then educate labellers about user needs through judging
guidelines, training and monitoring. This paper introduces an alternate
approach for improving label quality. It takes careful feedback from real
users, which by definition is the highest-quality first-party gold data that
can be derived, and develops an large language model prompt that agrees with
that data.
We present ideas and observations from deploying language models for
large-scale relevance labelling at Bing, and illustrate with data from TREC. We
have found large language models can be effective, with accuracy as good as
human labellers and similar capability to pick the hardest queries, best runs,
and best groups. Systematic changes to the prompts make a difference in
accuracy, but so too do simple paraphrases. To measure agreement with real
searchers needs high-quality ``gold'' labels, but with these we find that
models produce better labels than third-party workers, for a fraction of the
cost, and these labels let us train notably better rankers
- …