32 research outputs found
Probability models for information retrieval based on divergence from randomness
This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms.
The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval.
We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach.
In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models
PROPAGATE: a seed propagation framework to compute Distance-based metrics on Very Large Graphs
We propose PROPAGATE, a fast approximation framework to estimate
distance-based metrics on very large graphs such as the (effective) diameter,
the (effective) radius, or the average distance within a small error. The
framework assigns seeds to nodes and propagates them in a BFS-like fashion,
computing the neighbors set until we obtain either the whole vertex set (the
diameter) or a given percentage (the effective diameter). At each iteration, we
derive compressed Boolean representations of the neighborhood sets discovered
so far. The PROPAGATE framework yields two algorithms: PROPAGATE-P, which
propagates all the seeds in parallel, and PROPAGATE-s which propagates the
seeds sequentially. For each node, the compressed representation of the
PROPAGATE-P algorithm requires bits while that of PROPAGATE-S only bit.
Both algorithms compute the average distance, the effective diameter, the
diameter, and the connectivity rate within a small error with high probability:
for any and using sample nodes, the error for the average distance is
bounded by , the error for the
effective diameter and the diameter are bounded by , and the error for the connectivity rate is bounded
by where is the diameter and is a measure of
connectivity of the graph. The time complexity is , where is the number of edges of the
graph. The experimental results show that the PROPAGATE framework improves the
current state of the art both in accuracy and speed. Moreover, we
experimentally show that PROPAGATE-S is also very efficient for solving the All
Pair Shortest Path problem in very large graphs
IIR 2012 - Italian Information Retrieval Workshop, Proceedings of the 3rd Italian Information Retrieval Workshop Bari, Italy, January 26-27, 2012
The purpose of the Italian Information Retrieval (IIR) workshop series is to provide an international
meeting forum for stimulating and disseminating research in Information Retrieval and related
disciplines, where researchers, especially early stage Italian researchers, can exchange ideas and
present results in an informal way.
IIR 2012 took place in Bari, Italy, at the Department of Computer Science, University of Bari Aldo
Moro, on January 26‐27, 2012, following the first two successful editions in Padua (2010)
and Milan (2011).
We received 37 submissions, including full and short original papers with new research results, as
well as short papers describing ongoing projects or presenting already published results. Most
contributors to IIR 2012 were PhD students and early stage researchers. Each submission was
reviewed by at least two members of the Program Committee, and 24 papers were selected on
the basis of originality, technical depth, style of presentation, and impact.
The 24 papers published in these proceedings cover six main topics: ranking, text classification,
evaluation and geographic information retrieval, filtering, content analysis, and information
retrieval applications. Twenty papers are written in English and four in Italian. We also include an
abstract of the invited talk given by Roberto Navigli (Department of Computer Science, University
of Rome “La Sapienza”), who presented a novel approach to Web search result clustering based on
the automated discovery of word senses from raw text.
Towards a Better Understanding of the Relationship Between Probabilistic Models in IR
Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work