2,519 research outputs found

    Probabilistic models of information retrieval based on measuring the divergence from randomness

    Get PDF
    We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model

    Setting per-field normalisation hyper-parameters for the named-page finding search task

    Get PDF
    Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained

    Measuring concept similarities in multimedia ontologies: analysis and evaluations

    Get PDF
    The recent development of large-scale multimedia concept ontologies has provided a new momentum for research in the semantic analysis of multimedia repositories. Different methods for generic concept detection have been extensively studied, but the question of how to exploit the structure of a multimedia ontology and existing inter-concept relations has not received similar attention. In this paper, we present a clustering-based method for modeling semantic concepts on low-level feature spaces and study the evaluation of the quality of such models with entropy-based methods. We cover a variety of methods for assessing the similarity of different concepts in a multimedia ontology. We study three ontologies and apply the proposed techniques in experiments involving the visual and semantic similarities, manual annotation of video, and concept detection. The results show that modeling inter-concept relations can provide a promising resource for many different application areas in semantic multimedia processing

    Probability models for information retrieval based on divergence from randomness

    Get PDF
    This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms. The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval. We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach. In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models

    Using relevance feedback in expert search

    Get PDF
    In Enterprise settings, expert search is considered an important task. In this search task, the user has a need for expertise - for instance, they require assistance from someone about a topic of interest. An expert search system assists users with their "expertise need" by suggesting people with relevant expertise to the topic of interest. In this work, we apply an expert search approach that does not explicitly rank candidates in response to a query, but instead implicitly ranks candidates by taking into account a ranking of document with respect to the query topic. Pseudo-relevance feedback, aka query expansion, has been shown to improve retrieval performance in adhoc search tasks. In this work, we investigate to which extent query expansion can be applied in an expert search task to improve the accuracy of the generated ranking of candidates. We define two approaches for query expansion, one based on the initial of ranking of documents for the query topic. The second approach is based on the final ranking of candidates. The aims of this paper are two-fold. Firstly, to determine if query expansion can be successfully applied in the expert search task, and secondly, to ascertain if either of the two forms of query expansion can provide robust, improved retrieval performance. We perform a thorough evaluation contrasting the two query expansion approaches in the context of the TREC 2005 and 2006 Enterprise tracks

    Evaluation of probabilistic models for word frequency and information retrieval

    Full text link
    Volume 2 Issue 9 (September 2014

    Non-Compositional Term Dependence for Information Retrieval

    Full text link
    Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection [21] to IR. Our approach, integrated into ranking using Markov Random Fields [31], yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR
    corecore