Search CORE

3 research outputs found

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Author: A. Moffat
A. Trotman
C. Zhai
I.H. Witten
L. Azzopardi
M. Persin
V.N. Anh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Abstract. Document-centric static index pruning methods provide smaller indexes and faster query times by dropping some within-document term information from inverted lists. We present a method of pruning in-verted lists derived from the formulation of unigram language models for retrieval. Our method is based on the statistical significance of term frequency ratios: using the two-sample two-proportion (2P2N) test, we statistically compare the frequency of occurrence of a word within a given document to the frequency of its occurrence in the collection to de-cide whether to prune it. Experimental results show that this technique can be used to significantly decrease the size of the index and querying speed with less compromise to retrieval effectiveness than similar heuris-tic methods. Furthermore, we give a formal statistical justification for such methods.

CiteSeerX

Crossref

An efficient computation of the multiple-Bernoulli language model

Author: Azzopardi L.
Losada D.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

The Multiple Bernoulli (MB) Language Model has been generally considered too computationally expensive for practical purposes and superseded by the more efficient multinomial approach. While, the model has many attractive properties, little is actually known about the retrieval effectiveness of the MB model due to its high cost of execution. In this paper, we show how an efficient implementation of this model can be achieved. The resulting method is comparable in terms of efficiency to other standard term matching algorithms (such as the vector space model, BM25 and the multinomial Language Model)

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten