Strengthening weak supervision for information retrieval


The limited availability of ground truth relevance labels has been a major impediment to the application of supervised machine learning techniques to ad-hoc document retrieval and ranking. As a result, unsupervised scoring methods, such as BM25 and TF-IDF, remain strong competitors to deep learning approaches whose counterparts have brought on dramatic improvements in other domains, such as computer vision and natural language processing. However, recent works have shown that it is possible to take advantage of the performance of unsupervised methods to generate training data necessary for learning-to-rank models. Surprisingly, machine learning models trained on this generated data can outperform the original unsupervised method. The key limitation to this line of work is the size of the training set required to surpass the performance of the original unsupervised method, which can be as large as 10¹³ training examples. Building on these insights, this work proposes two methods to reduce the amount of training data required. The first method takes inspiration from crowdsourcing, and leverages multiple unsupervised rankers to generate soft, or noise-aware, training labels. The second identifies harmful, or mislabeled, training examples and removes them from the training set. We show that our methods allow us to surpass the performance of the unsupervised baseline with far fewer training examples than previous works.Electrical and Computer Engineerin

Similar works

Full text


UT Digital Repository

Provided a free PDF time updated on 12/3/2019View original full text link

This paper was published in UT Digital Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.