14 research outputs found
Learning More From Less: Towards Strengthening Weak Supervision for Ad-Hoc Retrieval
The limited availability of ground truth relevance labels has been a major
impediment to the application of supervised methods to ad-hoc retrieval. As a
result, unsupervised scoring methods, such as BM25, remain strong competitors
to deep learning techniques which have brought on dramatic improvements in
other domains, such as computer vision and natural language processing. Recent
works have shown that it is possible to take advantage of the performance of
these unsupervised methods to generate training data for learning-to-rank
models. The key limitation to this line of work is the size of the training set
required to surpass the performance of the original unsupervised method, which
can be as large as training examples. Building on these insights, we
propose two methods to reduce the amount of training data required. The first
method takes inspiration from crowdsourcing, and leverages multiple
unsupervised rankers to generate soft, or noise-aware, training labels. The
second identifies harmful, or mislabeled, training examples and removes them
from the training set. We show that our methods allow us to surpass the
performance of the unsupervised baseline with far fewer training examples than
previous works.Comment: SIGIR 201
Neural check-worthiness ranking with weak supervision:Finding sentences for fact-checking
Automatic fact-checking systems detect misinformation, such as fake news, by
(i) selecting check-worthy sentences for fact-checking, (ii) gathering related
information to the sentences, and (iii) inferring the factuality of the
sentences. Most prior research on (i) uses hand-crafted features to select
check-worthy sentences, and does not explicitly account for the recent finding
that the top weighted terms in both check-worthy and non-check-worthy sentences
are actually overlapping [15]. Motivated by this, we present a neural
check-worthiness sentence ranking model that represents each word in a sentence
by \textit{both} its embedding (aiming to capture its semantics) and its
syntactic dependencies (aiming to capture its role in modifying the semantics
of other terms in the sentence). Our model is an end-to-end trainable neural
network for check-worthiness ranking, which is trained on large amounts of
unlabelled data through weak supervision. Thorough experimental evaluation
against state of the art baselines, with and without weak supervision, shows
our model to be superior at all times (+13% in MAP and +28% at various
Precision cut-offs from the best baseline with statistical significance).
Empirical analysis of the use of weak supervision, word embedding pretraining
on domain-specific data, and the use of syntactic dependencies of our model
reveals that check-worthy sentences contain notably more identical syntactic
dependencies than non-check-worthy sentences.Comment: 6 page
Selective Weak Supervision for Neural Information Retrieval
This paper democratizes neural information retrieval to scenarios where large
scale relevance training signals are not available. We revisit the classic IR
intuition that anchor-document relations approximate query-document relevance
and propose a reinforcement weak supervision selection method, ReInfoSelect,
which learns to select anchor-document pairs that best weakly supervise the
neural ranker (action), using the ranking performance on a handful of relevance
labels as the reward. Iteratively, for a batch of anchor-document pairs,
ReInfoSelect back propagates the gradients through the neural ranker, gathers
its NDCG reward, and optimizes the data selection network using policy
gradients, until the neural ranker's performance peaks on target relevance
metrics (convergence). In our experiments on three TREC benchmarks, neural
rankers trained by ReInfoSelect, with only publicly available anchor data,
significantly outperform feature-based learning to rank methods and match the
effectiveness of neural rankers trained with private commercial search logs.
Our analyses show that ReInfoSelect effectively selects weak supervision
signals based on the stage of the neural ranker training, and intuitively picks
anchor-document pairs similar to query-document pairs.Comment: Accepted by WWW 202