842 research outputs found
Exploiting the similarity of non-matching terms at retrieval time
In classic information retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem, known as 'term mismatch', has been recognised for a long time by the information retrieval community and a number of possible solutions have been proposed. Here I present a preliminary investigation into a new class of retrieval models that attempt to solve the term mismatch problem by exploiting complete or partial knowledge of term similarity in the term space. The use of term similarity can enhance classic retrieval models by taking into account non-matching terms. The theoretical advantages and drawbacks of these models are presented and compared with other models tackling the same problem. A preliminary experimental investigation into the performance gain achieved by exploiting term similarity with the proposed models is presented and discussed
Multi-objective resource selection in distributed information retrieval
In a Distributed Information Retrieval system, a user submits a query to a broker, which determines how to yield a given number of documents from all possible resource servers. In this paper, we propose a multi-objective model for this resource selection task. In this model, four aspects are considered simultaneously in the choice of the resource: document's relevance to the given query, time, monetary cost, and similarity between resources. An optimized solution is achieved by comparing the performances of all possible candidates. Some variations of the basic model are also given, which improve the basic model's efficiency
Probabilistic learning for selective dissemination of information
New methods and new systems are needed to filter or to selectively distribute the increasing volume of electronic information being produced nowadays. An effective information filtering system is one that provides the exact information that fulfills user's interests with the minimum effort by the user to describe it. Such a system will have to be adaptive to the user changing interest. In this paper we describe and evaluate a learning model for information filtering which is an adaptation of the generalized probabilistic model of information retrieval. The model is based on the concept of 'uncertainty sampling', a technique that allows for relevance feedback both on relevant and nonrelevant documents. The proposed learning model is the core of a prototype information filtering system called ProFile
User centred evaluation of an automatically constructed hyper-textbook
As hypertext systems become widely available and their popularity increases, attention has turned to converting existing textual documents into hypertextual form. An important issue in this area is the fully automatic production of hypertext for learning, teaching, training, or self-referencing. Although many studies have addressed the problem of producing hyper-books, either manually or semi-automatically, the actual usability of hyper-books tools is still an area of ongoing research. This article presents an effort to investigate the effectiveness of a hyper-textbook for self-referencing produced in a fully automatic way. The hyper-textbook is produced using the Hyper-TextBook methodology. We developed a taskbased evaluation scheme and performed a comparative usercentred evaluation between a hyper-textbook and a conventional, printed form of the same textbook. The results indicate that the hyper-textbook, in most cases, improves speed, accuracy, and user satisfaction in comparison to the printed form of the textbook
Users' perception of relevance of spoken documents
We present the results of a study of user's perception of relevance of documents. The aim is to study experimentally how users' perception varies depending on the form that retrieved documents are presented. Documents retrieved in response to a query are presented to users in a variety of ways, from full text to a machine spoken query-biased automatically-generated summary, and the difference in users' perception of relevance is studied. The experimental results suggest that the effectiveness of advanced multimedia information retrieval applications may be affected by the low level of users' perception of relevance of retrieved documents
Experiments with document archive size detection
The size of a document archive is a very important parameter for resource selection in distributed information retrieval systems. In this paper, we present a method for automatically detecting the size (ie the number of documents) of a document archive, in case the archive itself does not provide such information. In addition, a method for detecting incremental change of the archive size is also presented, which can be useful for deciding if a resource description has become obsolete and needs to be regenerated. An experimental evaluation of these methods shows that they provide quite acurate information
Towards better measures: evaluation of estimated resource description quality for distributed IR
An open problem for Distributed Information Retrieval systems (DIR) is how to represent large document repositories, also known as resources, both accurately and efficiently. Obtaining resource description estimates is an important phase in DIR, especially in non-cooperative environments. Measuring the quality of an estimated resource description is a contentious issue as current measures do not provide an adequate indication of quality. In this paper, we provide an overview of these currently applied measures of resource description quality, before proposing the Kullback-Leibler (KL) divergence as an alternative. Through experimentation we illustrate the shortcomings of these past measures, whilst providing evidence that KL is a more appropriate measure of quality. When applying KL to compare different QBS algorithms, our experiments provide strong evidence in favour of a previously unsupported hypothesis originally posited in the initial Query-Based Sampling work
Automatically attaching web pages to an ontology
This paper describes a proposed system for automatically attaching material from the world wide web to concepts in an ontology. The motivation for this research stems from the Diogene project, which requires the project's own databases of learning objects to be augmented with additional resources from the web. Two main approaches to this problem are being taken: one using ontology mapping, and another based on the conventional text search facilities of the web, covered in this paper. By generating queries based on the concepts in the ontology, the aim is to retrieve material from the web, and then filter it to ensure its proper correspondence with a concept. The Diogene system will be briefly outlined, before the query-generation system is described. A small pilot experiment, designed to provide some initial results and insight into the problem, is then presented
Ontology mapping by concept similarity
This paper presents an approach to the problem of mapping ontologies. The motivation for the research stems from the Diogene Project which is developing a web training environment for ICT professionals. The system includes high quality training material from registered content providers, and free web material will also be made available through the project's "Web Discovery" component. This involves using web search engines to locate relevant material, and mapping the ontology at the core of the Diogene system to other ontologies that exist on the Semantic Web. The project's approach to ontology mapping is presented, and an evaluation of this method is described
An evaluation of resource description quality measures
An open problem for Distributed Information Retrieval is how to represent large document repositories (known as resources) efficiently. To facilitate resource selection, estimated descriptions of each resource are required, especially when faced with non-cooperative distributed environments. Accurate and efficient Resource description estimation is required as this can have an affect on resource selection, and as a consequence retrieval quality. Query-Based Sampling (QBS) has been proposed as a novel solution for resource estimation, with proceeding techniques developed therafter. However, the challenge to determine if one QBS technique is better at generating resource description than another is still an unresolved issue. The initial metrics tested and deployed for measuring resource description quality were the Collection Term Frequency ratio (CTF) and Spearman Rank Correlation Coefficient (SRCC). The former provides an indication of the percentage of terms seen, whilst the later measures the term ranking order, although neither consider the term frequency, which is important for resource selection. We re-examine this problem and consider measuring the quality of a resource description in context to resource selection, where an estimate of the probability of a term given the resource is typically required. We believe a natural measure for comparing the estimated resource against the actual resource is the Kullback-Leibler Divergence (KL) measure. KL addresses the concerns put forward previously, by not over-representing low frequency terms, and also considering term order. In this paper, we re-assess the two previous measures alongside KL. Our preliminary investigation revealed that the former metrics display contradictory results. Whilst, KL suggested a different QBS technique than that prescribed in, would provide better estimates. This is a significant result, because it now remains unclear as to which technique will consistently provide better resource descriptions. The remainder of this paper details the three measures, the experimental analysis of our preliminary study and outlines our points of concern along with further research directions
- ā¦