Neural ranking models (NRMs) have demonstrated effective performance in
several information retrieval (IR) tasks. However, training NRMs often requires
large-scale training data, which is difficult and expensive to obtain. To
address this issue, one can train NRMs via weak supervision, where a large
dataset is automatically generated using an existing ranking model (called the
weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the
observed data and significantly outperform the weak labeler. This paper
generalizes this idea through an iterative re-labeling process, demonstrating
that weakly supervised models can iteratively play the role of weak labeler and
significantly improve ranking performance without using manually labeled data.
The proposed Generalized Weak Supervision (GWS) solution is generic and
orthogonal to the ranking model architecture. This paper offers four
implementations of GWS: self-labeling, cross-labeling, joint cross- and
self-labeling, and greedy multi-labeling. GWS also benefits from a query
importance weighting mechanism based on query performance prediction methods to
reduce noise in the generated training data. We further draw a theoretical
connection between self-labeling and Expectation-Maximization. Our experiments
on two passage retrieval benchmarks suggest that all implementations of GWS
lead to substantial improvements compared to weak supervision in all cases