45 research outputs found

    ON THE USE OF THE DEMPSTER SHAFER MODEL IN INFORMATION INDEXING AND RETRIEVAL APPLICATIONS

    Get PDF
    The Dempster Shafer theory of evidence concerns the elicitation and manipulation of degrees of belief rendered by multiple sources of evidence to a common set of propositions. Information indexing and retrieval applications use a variety of quantitative means - both probabilistic and quasi-probabilistic - to represent and manipulate relevance numbers and index vectors. Recently, several proposals were made to use the Dempster Shafes model as a relevance calculus in such applications. The paper provides a critical review of these proposals, pointing at several theoretical caveats and suggesting ways to resolve them. The methodology is based on expounding a canonical indexing model whose relevance measures and combination mechanisms are shown to be isomorphic to Shafer's belief functions and to Dempster's rule, respectively. Hence, the paper has two objectives: (i) to describe and resolve some caveats in the way the Dempster Shafer theory is applied to information indexing and retrieval, and (ii) to provide an intuitive interpretation of the Dempster Shafer theory, as it unfolds in the simple context of a canonical indexing model.Information Systems Working Papers Serie

    The accessibility dimension for structured document retrieval

    Get PDF
    Structured document retrieval aims at retrieving the document components that best satisfy a query, instead of merely retrieving pre-defined document units. This paper reports on an investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents. The tf-idf-acc approach is defined using a probabilistic relational algebra. To investigate the retrieval quality and estimate the acc values, we developed a method that automatically constructs diverse test collections of structured documents from a standard test collection, with which experiments were carried out. The analysis of the experiments provides estimates of the acc values

    Combination of Evidence in Dempster-Shafer Theory

    Full text link

    Theories of information and uncertainty for the modelling of information retrieval : an application of situation theory and Dempster-Shafer's theory of evidence

    Get PDF
    Current information retrieval models only offer simplistic and specific representations of information. Therefore, there is a need for the development of a new formalism able to model information retrieval systems in a more generic manner. In 1986, Van Rijsbergen suggested that such formalisms can be both appropriately and powerfully defined within a logic. The resulting formalism should capture information as it appears in an information retrieval system, and also in any of its inherent forms. The aim of this thesis is to understand the nature of information in information retrieval, and to propose a logic-based model of an information retrieval system that reflects this nature. The first objective of this thesis is to identify essential features of information in an information retrieval system. These are: 0 flow, 0 intensionality, 0 partiality, 0 structure, 0 significance, and o uncertainty. It is shown that the first four features are qualitative, whereas the last two are quantitative, and that their modelling requires different frameworks: a theory of information, and a theory of uncertainty, respectively. The second objective of this thesis is to determine the appropriate framework for each type of feature, and to develop a method to combine them in a consistent fashion. The combination is based on the Transformation Principle. Many specific attempts have been made to derive an adequate definition of information. The one adopted in this thesis is based on that of Dretske, Barwise, and Devlin who claimed that there is a primitive notion of information in terms of which a logic can be defined, and subsequently developed a theory of information, namely Situation Theory. Their approach was in accordance with Van Rijsbergen' s suggestion of a logic-based formalism for modelling an information retrieval system. This thesis shows that Situation Theory is best at representing all the qualitative features. Regarding the modelling of the quantitative features of information, this thesis shows that the framework that models them best is the Dempster-Shafer Theory of Evidence, together with the notion of refinement, later introduced by Shafer. The third objective of this thesis is to develop a model of an information retrieval system based on Situation Theory and the Dempster-Shafer Theory of Evidence. This is done in two steps. First, the unstructured model is defined in which the structure and the significance of information are not accounted for. Second, the unstructured model is extended into the structured model, which incorporates the structure and the significance of information. This strategy is adopted because it enables the careful representation of the flow of information to be performed first. The final objective of the thesis is to implement the model and to perform empirical evaluation to assess its validity. The unstructured and the structured models are implemented based on an existing on-line thesaurus, known as WordNet. The experiments performed to evaluate the two models use the National Physical Laboratory standard test collection. The experimental performance obtained was poor, because it was difficult to extract the flow of information from the document set. This was mainly due to the data used in the experimentation which was inappropriate for the test collection. However, this thesis shows that if more appropriate data, for example, indexing tools and thesauri, were available, better performances would be obtained. The conclusion of this work was that Situation Theory, combined with the Dempster-Shafer Theory of Evidence, allows the appropriate and powerful representation of several essential features of information in an information retrieval system. Although its implementation presents some difficulties, the model is the first of its kind to capture, in a general manner, these features within a uniform framework. As a result, it can be easily generalized to many types of information retrieval systems (e.g., interactive, multimedia systems), or many aspects of the retrieval process (e.g., user modelling)

    Relating Dependent Terms in Information Retrieval

    Get PDF
    Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indépendants. Pourtant, les termes qui apparaissent dans le même contexte sont souvent dépendants. L’absence de la prise en compte de ces dépendances est une des causes de l’introduction de bruit dans le résultat (résultat non pertinents). Certaines études ont proposé d’intégrer certains types de dépendance, tels que la proximité, la cooccurrence, la contiguïté et de la dépendance grammaticale. Dans la plupart des cas, les modèles de dépendance sont construits séparément et ensuite combinés avec le modèle traditionnel de mots avec une importance constante. Par conséquent, ils ne peuvent pas capturer correctement la dépendance variable et la force de dépendance. Par exemple, la dépendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thèse, nous étudions différentes approches pour capturer les relations des termes et de leurs forces de dépendance. Nous avons proposé des méthodes suivantes: ─ Nous réexaminons l'approche de combinaison en utilisant différentes unités d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous étudions la possibilité d'utiliser bi-gramme et uni-gramme comme unité de traduction pour le chinois. Plusieurs modèles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallèle. Une requête en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considérons la dépendance entre les termes en utilisant la théorie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considérée comme représentant l'ensemble de tous les termes constituants. La probabilité est assignée à un tel ensemble de termes plutôt qu’a chaque terme individuel. Au moment d’évaluation de requête, cette probabilité est redistribuée aux termes de la requête si ces derniers sont différents. Cette approche nous permet d'intégrer les relations de dépendance entre les termes. Nous proposons un modèle discriminant pour intégrer les différentes types de dépendance selon leur force et leur utilité pour la RI. Notamment, nous considérons la dépendance de contiguïté et de cooccurrence à de différentes distances, c’est-à-dire les bi-grammes et les paires de termes dans une fenêtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dépendants est déterminé selon un ensemble des caractères, en utilisant la régression SVM. Toutes les méthodes proposées sont évaluées sur plusieurs collections en anglais et/ou chinois, et les résultats expérimentaux montrent que ces méthodes produisent des améliorations substantielles sur l'état de l'art.Search engine has become an integral part of our life. More than one-third of world populations are Internet users. Most users turn to a search engine as the quick way to finding the information or product they want. Information retrieval (IR) is the foundation for modern search engines. Traditional information retrieval approaches assume that indexing terms are independent. However, terms occurring in the same context are often dependent. Failing to recognize the dependencies between terms leads to noise (irrelevant documents) in the result. Some studies have proposed to integrate term dependency of different types, such as proximity, co-occurrence, adjacency and grammatical dependency. In most cases, dependency models are constructed apart and then combined with the traditional word-based (unigram) model on a fixed importance proportion. Consequently, they cannot properly capture variable term dependency and its strength. For example, dependency between adjacent words “black Friday” is more important to consider than those of between “road constructions”. In this thesis, we try to study different approaches to capture term relationships and their dependency strengths. We propose the following methods for monolingual IR and Cross-Language IR (CLIR): We re-examine the combination approach by using different indexing units for Chinese monolingual IR, then propose the similar method for CLIR. In addition to the traditional method based on words, we investigate the possibility of using Chinese bigrams and unigrams as translation units. Several translation models from English words to Chinese unigrams, bigrams and words are created based on a parallel corpus. An English query is then translated in several ways, each producing a ranking score. The final ranking score combines all these types of translations. We incorporate dependencies between terms in our model using Dempster-Shafer theory of evidence. Every occurrence of a text fragment in a document is represented as a set which includes all its implied terms. Probability is assigned to such a set of terms instead of individual terms. During query evaluation phase, the probability of the set can be transferred to those of the related query, allowing us to integrate language-dependent relations to IR. We propose a discriminative language model that integrates different term dependencies according to their strength and usefulness to IR. We consider the dependency of adjacency and co-occurrence within different distances, i.e. bigrams, pairs of terms within text window of size 2, 4, 8 and 16. The weight of bigram or a pair of dependent terms in the final model is learnt according to a set of features. All the proposed methods are evaluated on several English and/or Chinese collections, and experimental results show these methods achieve substantial improvements over state-of-the-art baselines

    Georeferencing Flickr photos using language models at different levels of granularity: an evidence based approach

    Get PDF
    The topic of automatically assigning geographic coordinates to Web 2.0 resources based on their tags has recently gained considerable attention. However, the coordinates that are produced by automated techniques are necessarily variable, since not all resources are described by tags that are sufficiently descriptive. Thus there is a need for adaptive techniques that assign locations to photos at the right level of granularity, or, in some cases, even refrain from making any estimations regarding location at all. To this end, we consider the idea of training language models at different levels of granularity, and combining the evidence provided by these language models using Dempster and Shafer’s theory of evidence. We provide experimental results which clearly confirm that the increased spatial awareness that is thus gained allows us to make better informed decisions, and moreover increases the overall accuracy of the individual language models
    corecore