6 research outputs found
Building Retrieval Systems for the ClueWeb22-B Corpus
The ClueWeb22 dataset containing nearly 10 billion documents was released in
2022 to support academic and industry research. The goal of this project was to
build retrieval baselines for the English section of the "super head" part
(category B) of this dataset. These baselines can then be used by the research
community to compare their systems and also to generate data to train/evaluate
new retrieval and ranking algorithms. The report covers sparse and dense first
stage retrievals as well as neural rerankers that were implemented for this
dataset. These systems are available as a service on a Carnegie Mellon
University cluster