A probabilistic justification for using tf.idf term weighting in information retrieval

Hiemstra, D.

research

A probabilistic justification for using tf.idf term weighting in information retrieval

Authors: D. Hiemstra
Publication date: 1 January 2000
Publisher: Springer Verlag
Doi

Abstract

This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

NARCIS

Last time updated on 15/10/2017

University of Twente Research Information

oai:ris.utwente.nl:publication...

Last time updated on 12/07/2023

Radboud Repository

oai:repository.ubn.ru.nl:2066/...

Last time updated on 22/04/2021