731 research outputs found
Probabilistic models of information retrieval based on measuring the divergence from randomness
We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model
Fisher's exact test explains a popular metric in information retrieval
Term frequency-inverse document frequency, or tf-idf for short, is a
numerical measure that is widely used in information retrieval to quantify the
importance of a term of interest in one out of many documents. While tf-idf was
originally proposed as a heuristic, much work has been devoted over the years
to placing it on a solid theoretical foundation. Following in this tradition,
we here advance the first justification for tf-idf that is grounded in
statistical hypothesis testing. More precisely, we first show that the
one-tailed version of Fisher's exact test, also known as the hypergeometric
test, corresponds well with a common tf-idf variant on selected real-data
information retrieval tasks. We then set forth a mathematical argument that
suggests the tf-idf variant approximates the negative logarithm of the
one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution
tail probability). The Fisher's exact test interpretation of this common tf-idf
variant furnishes the working statistician with a ready explanation of tf-idf's
long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
Probabilistic retrieval models - relationships, context-specific application, selection and implementation
PhDRetrieval models are the core components of information retrieval systems, which guide the document
and query representations, as well as the document ranking schemes. TF-IDF, binary
independence retrieval (BIR) model and language modelling (LM) are three of the most influential
contemporary models due to their stability and performance. The BIR model and LM
have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose
theoretical justification always fascinates researchers.
This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model,
wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between
the BIR model and LM, and derives TF-IDF from the probabilistic framework.
Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various
ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic
operator are demonstrated. Typical models have been implemented.
The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies
counted based on assorted element types or within different text scopes. The hypothesis
is that they can help to rank the elements in structured document retrieval. The thesis applies
context-specific frequencies on term weighting schemes in these models, and the outcome is a
generalised retrieval model with regard to both element and document ranking.
The retrieval models behave differently on the same query set: for some queries, one model
performs better, for other queries, another model is superior. Therefore, one idea to improve the
overall performance of a retrieval system is to choose for each query the model that is likely
to perform the best. This thesis proposes and empirically explores the model selection method
according to the correlation of query feature and query performance, which contributes to the
methodology of dynamically choosing a model.
In summary, this thesis contributes a study of probabilistic models and their relationships,
the probabilistic logical modelling of retrieval models, the usage and effect of context-specific
frequencies in models, and the selection of retrieval models
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
- …