Location of Repository

Inferring the score distribution of relevant and non-relevant documents is an essential task for many IR applications (e.g. information filtering, recall-oriented IR, meta-search, distributed IR). Modeling score distributions in an accurate manner is the basis of any inference. Thus, numerous score distribution models have been proposed in the literature. Most of the models were proposed on the basis of empirical evidence and goodness-of-fit. In this work, we model score distributions in a rather different, systematic manner. We start with a basic assumption on the distribution of terms in a document. Following the transformations applied on term frequencies by two basic ranking functions, BM25 and Language Models, we derive the distribution of the produced scores for all documents. Then we focus on the relevant documents. We detach our analysis from particular ranking functions. Instead, we consider a model for precision-recall curves, and given this model, we present a general mathematical framework which, given any score distribution for all retrieved documents, produces an analytical formula for the score distribution of relevant documents that is consistent with the precision-recall curves that follow the aforementioned model. In particular, assuming a Gamma distribution for all retrieved documents, we show that the derived distribution for the relevant documents resembles a Gaussian distribution with a heavy right tail

Topics:
Categories and Subject Descriptors, H.3.3 [Information Search and Retrieval] Retrieval models General Terms, Theory, Measurement Keywords, information retrieval, score distribution, density functions, recall-precision curve

Year: 2011

OAI identifier:
oai:CiteSeerX.psu:10.1.1.190.9000

Provided by:
CiteSeerX

Download PDF:To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.