32 research outputs found

    Probability models for information retrieval based on divergence from randomness

    Get PDF
    This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms. The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval. We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach. In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models

    PROPAGATE: a seed propagation framework to compute Distance-based metrics on Very Large Graphs

    Full text link
    We propose PROPAGATE, a fast approximation framework to estimate distance-based metrics on very large graphs such as the (effective) diameter, the (effective) radius, or the average distance within a small error. The framework assigns seeds to nodes and propagates them in a BFS-like fashion, computing the neighbors set until we obtain either the whole vertex set (the diameter) or a given percentage (the effective diameter). At each iteration, we derive compressed Boolean representations of the neighborhood sets discovered so far. The PROPAGATE framework yields two algorithms: PROPAGATE-P, which propagates all the ss seeds in parallel, and PROPAGATE-s which propagates the seeds sequentially. For each node, the compressed representation of the PROPAGATE-P algorithm requires ss bits while that of PROPAGATE-S only 11 bit. Both algorithms compute the average distance, the effective diameter, the diameter, and the connectivity rate within a small error with high probability: for any ε>0\varepsilon>0 and using s=Θ(lognε2)s=\Theta\left(\frac{\log n}{\varepsilon^2}\right) sample nodes, the error for the average distance is bounded by ξ=εΔα\xi = \frac{\varepsilon \Delta}{\alpha}, the error for the effective diameter and the diameter are bounded by ξ=εα\xi = \frac{\varepsilon}{\alpha}, and the error for the connectivity rate is bounded by ε\varepsilon where Δ\Delta is the diameter and α\alpha is a measure of connectivity of the graph. The time complexity is O(mΔlognε2)\mathcal{O}\left(m\Delta \frac{\log n}{\varepsilon^2}\right), where mm is the number of edges of the graph. The experimental results show that the PROPAGATE framework improves the current state of the art both in accuracy and speed. Moreover, we experimentally show that PROPAGATE-S is also very efficient for solving the All Pair Shortest Path problem in very large graphs

    SIGIR 2010 workshop program overview

    No full text

    IIR 2012 - Italian Information Retrieval Workshop, Proceedings of the 3rd Italian Information Retrieval Workshop Bari, Italy, January 26-27, 2012

    No full text
    The purpose of the Italian Information Retrieval (IIR) workshop series is to provide an international meeting forum for stimulating and disseminating research in Information Retrieval and related disciplines, where researchers, especially early stage Italian researchers, can exchange ideas and present results in an informal way. IIR 2012 took place in Bari, Italy, at the Department of Computer Science, University of Bari Aldo Moro, on January 26‐27, 2012, following the first two successful editions in Padua (2010) and Milan (2011).   We received 37 submissions, including full and short original papers with new research results, as well as short papers describing ongoing projects or presenting already published results. Most contributors to IIR 2012 were PhD students and early stage researchers. Each submission was reviewed by at least two members of the Program Committee, and 24 papers were selected on the basis of originality, technical depth, style of presentation, and impact.   The 24 papers published in these proceedings cover six main topics:  ranking, text classification, evaluation and geographic information retrieval, filtering, content analysis, and information retrieval applications. Twenty papers are written in English and four in Italian. We also include an abstract of the invited talk given by Roberto Navigli (Department of Computer Science, University of Rome “La Sapienza”), who presented a novel approach to Web search result clustering based on the automated discovery of word senses from raw text. 

    Towards a Better Understanding of the Relationship Between Probabilistic Models in IR

    No full text
    Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work
    corecore