25 research outputs found

    A survey on the use of relevance feedback for information access systems

    Get PDF
    Users of online search engines often find it difficult to express their need for information in the form of a query. However, if the user can identify examples of the kind of documents they require then they can employ a technique known as relevance feedback. Relevance feedback covers a range of techniques intended to improve a user's query and facilitate retrieval of information relevant to a user's information need. In this paper we survey relevance feedback techniques. We study both automatic techniques, in which the system modifies the user's query, and interactive techniques, in which the user has control over query modification. We also consider specific interfaces to relevance feedback systems and characteristics of searchers that can affect the use and success of relevance feedback systems

    Using biased support vector machine in image retrieval with self-organizing map.

    Get PDF
    Chan Chi Hang.Thesis submitted in: August 2004.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 105-114).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Problem Statement --- p.3Chapter 1.2 --- Major Contributions --- p.5Chapter 1.3 --- Publication List --- p.6Chapter 1.4 --- Thesis Organization --- p.7Chapter 2 --- Background Survey --- p.9Chapter 2.1 --- Relevance Feedback Framework --- p.9Chapter 2.1.1 --- Relevance Feedback Types --- p.11Chapter 2.1.2 --- Data Distribution --- p.12Chapter 2.1.3 --- Training Set Size --- p.14Chapter 2.1.4 --- Inter-Query Learning and Intra-Query Learning --- p.15Chapter 2.2 --- History of Relevance Feedback Techniques --- p.16Chapter 2.3 --- Relevance Feedback Approaches --- p.19Chapter 2.3.1 --- Vector Space Model --- p.19Chapter 2.3.2 --- Ad-hoc Re-weighting --- p.26Chapter 2.3.3 --- Distance Optimization Approach --- p.29Chapter 2.3.4 --- Probabilistic Model --- p.33Chapter 2.3.5 --- Bayesian Approach --- p.39Chapter 2.3.6 --- Density Estimation Approach --- p.42Chapter 2.3.7 --- Support Vector Machine --- p.48Chapter 2.4 --- Presentation Set Selection --- p.52Chapter 2.4.1 --- Most-probable strategy --- p.52Chapter 2.4.2 --- Most-informative strategy --- p.52Chapter 3 --- Biased Support Vector Machine for Content-Based Image Retrieval --- p.57Chapter 3.1 --- Motivation --- p.57Chapter 3.2 --- Background --- p.58Chapter 3.2.1 --- Regular Support Vector Machine --- p.59Chapter 3.2.2 --- One-class Support Vector Machine --- p.61Chapter 3.3 --- Biased Support Vector Machine --- p.63Chapter 3.4 --- Interpretation of parameters in BSVM --- p.67Chapter 3.5 --- Soft Label Biased Support Vector Machine --- p.69Chapter 3.6 --- Interpretation of parameters in Soft Label BSVM --- p.73Chapter 3.7 --- Relevance Feedback Using Biased Support Vector Machine --- p.74Chapter 3.7.1 --- Advantages of BSVM in Relevance Feedback . . --- p.74Chapter 3.7.2 --- Relevance Feedback Algorithm By BSVM --- p.75Chapter 3.8 --- Experiments --- p.78Chapter 3.8.1 --- Synthetic Dataset --- p.80Chapter 3.8.2 --- Real-World Dataset --- p.81Chapter 3.8.3 --- Experimental Results --- p.83Chapter 3.9 --- Conclusion --- p.86Chapter 4 --- Self-Organizing Map-based Inter-Query Learning --- p.88Chapter 4.1 --- Motivation --- p.88Chapter 4.2 --- Algorithm --- p.89Chapter 4.2.1 --- Initialization and Replication of SOM --- p.89Chapter 4.2.2 --- SOM Training for Inter-Query Learning --- p.90Chapter 4.2.3 --- Incorporate with Intra-Query Learning --- p.92Chapter 4.3 --- Experiments --- p.93Chapter 4.3.1 --- Synthetic Dataset --- p.95Chapter 4.3.2 --- Real-World Dataset --- p.95Chapter 4.3.3 --- Experimental Results --- p.97Chapter 4.4 --- Conclusion --- p.98Chapter 5 --- Conclusion --- p.102Bibliography --- p.10

    An empirical analysis of information filtering methods

    Get PDF
    The growth in the the number of news articles, blogs, images, and videos available on the Web is making if more challenging for people to find potentially useful information People have relied on search engines to satisfy their short-term needs, such as finding the telephone number for a restaurant; however, these systems have not been designed to support long-term needs, such as the research interests of academics. One approach to supporting long-term needs is to use an Information Filtering system to select potentially useful information from the vast amount being produced everyday. The similarities between Information Retrieval systems and Information Filtering systems are well-established. They have prompted the use of retrieval models and methods in filtering systems, which has had some success but has been criticised as a limiting factor due to the unique challenges of document filtering. A significant difference between these systems is the use case: a filtering system is intended to push information to the user over a period of time, whereas a retrieval system is intended for the user to pull information to themselves for immediate use. The main challenge that needs to be addressed by a filtering system is the transient nature of the information published on the Web and the drifting nature of information needs. These factors lead to an uncertain interplay between the components comprising a filtering system and this thesis presents an empirical analysis of how the main system components affect performance. The analysis explores the role of each system component independently and in conjunction with other components. The main contribution of this thesis is a deeper understanding of how different components affect performance and the interplay between these components. The outcome of this thesis intends to act as a guide for both practitioners and researchers interested in overcoming some of the challenges of building filtering system

    Probability models for information retrieval based on divergence from randomness

    Get PDF
    This thesis devises a novel methodology based on probability theory, suitable for the construction of term-weighting models of Information Retrieval. Our term-weighting functions are created within a general framework made up of three components. Each of the three components is built independently from the others. We obtain the term-weighting functions from the general model in a purely theoretic way instantiating each component with different probability distribution forms. The thesis begins with investigating the nature of the statistical inference involved in Information Retrieval. We explore the estimation problem underlying the process of sampling. De Finetti’s theorem is used to show how to convert the frequentist approach into Bayesian inference and we display and employ the derived estimation techniques in the context of Information Retrieval. We initially pay a great attention to the construction of the basic sample spaces of Information Retrieval. The notion of single or multiple sampling from different populations in the context of Information Retrieval is extensively discussed and used through-out the thesis. The language modelling approach and the standard probabilistic model are studied under the same foundational view and are experimentally compared to the divergence-from-randomness approach. In revisiting the main information retrieval models in the literature, we show that even language modelling approach can be exploited to assign term-frequency normalization to the models of divergence from randomness. We finally introduce a novel framework for the query expansion. This framework is based on the models of divergence-from-randomness and it can be applied to arbitrary models of IR, divergence-based, language modelling and probabilistic models included. We have done a very large number of experiment and results show that the framework generates highly effective Information Retrieval models

    Text categorization methods for automatic estimation of verbal intelligence

    Get PDF
    In this paper we investigate whether conventional text categorization methods may suffice to infer different verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estimation of users in a spoken language dialog system may be useful when defining an optimal dialog strategy by improving its adaptation capabilities. The work is based on a corpus containing descriptions (i.e. monologs) of a short film by test persons yielding different educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of variance was performed to compare the monologs with the film transcription and to demonstrate that there are differences in the vocabulary used by the test persons yielding different verbal intelligence levels. Then, for the classification task, the monologs were represented as feature vectors using the classical TF–IDF weighting scheme. The Naive Bayes, k-nearest neighbors and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained

    Concept drift learning and its application to adaptive information filtering

    Get PDF
    Tracking the evolution of user interests is a problem instance of concept drift learning. Keeping track of multiple interest categories is a natural phenomenon as well as an interesting tracking problem because interests can emerge and diminish at different time frames. The first part of this dissertation presents a Multiple Three-Descriptor Representation (MTDR) algorithm, a novel algorithm for learning concept drift especially built for tracking the dynamics of multiple target concepts in the information filtering domain. The learning process of the algorithm combines the long-term and short-term interest (concept) models in an attempt to benefit from the strength of both models. The MTDR algorithm improves over existing concept drift learning algorithms in the domain. Being able to track multiple target concepts with a few examples poses an even more important and challenging problem because casual users tend to be reluctant to provide the examples needed, and learning from a few labeled data is generally difficult. The second part presents a computational Framework for Extending Incomplete Labeled Data Stream (FEILDS). The system modularly extends the capability of an existing concept drift learner in dealing with incomplete labeled data stream. It expands the learner's original input stream with relevant unlabeled data; the process generates a new stream with improved learnability. FEILDS employs a concept formation system for organizing its input stream into a concept (cluster) hierarchy. The system uses the concept and cluster hierarchy to identify the instance's concept and unlabeled data relevant to a concept. It also adopts the persistence assumption in temporal reasoning for inferring the relevance of concepts. Empirical evaluation indicates that FEILDS is able to improve the performance of existing learners particularly when learning from a stream with a few labeled data. Lastly, a new concept formation algorithm, one of the key components in the FEILDS architecture, is presented. The main idea is to discover intrinsic hierarchical structures regardless of the class distribution and the shape of the input stream. Experimental evaluation shows that the algorithm is relatively robust to input ordering, consistently producing a hierarchy structure of high quality

    A heuristic information retrieval study : an investigation of methods for enhanced searching of distributed data objects exploiting bidirectional relevance feedback

    Get PDF
    A thesis submitted for the degree of Doctor of Philosophy of the University of LutonThe primary aim of this research is to investigate methods of improving the effectiveness of current information retrieval systems. This aim can be achieved by accomplishing numerous supporting objectives. A foundational objective is to introduce a novel bidirectional, symmetrical fuzzy logic theory which may prove valuable to information retrieval, including internet searches of distributed data objects. A further objective is to design, implement and apply the novel theory to an experimental information retrieval system called ANACALYPSE, which automatically computes the relevance of a large number of unseen documents from expert relevance feedback on a small number of documents read. A further objective is to define a methodology used in this work as an experimental information retrieval framework consisting of multiple tables including various formulae which anow a plethora of syntheses of similarity functions, ternl weights, relative term frequencies, document weights, bidirectional relevance feedback and history adjusted term weights. The evaluation of bidirectional relevance feedback reveals a better correspondence between system ranking of documents and users' preferences than feedback free system ranking. The assessment of similarity functions reveals that the Cosine and Jaccard functions perform significantly better than the DotProduct and Overlap functions. The evaluation of history tracking of the documents visited from a root page reveals better system ranking of documents than tracking free information retrieval. The assessment of stemming reveals that system information retrieval performance remains unaffected, while stop word removal does not appear to be beneficial and can sometimes be harmful. The overall evaluation of the experimental information retrieval system in comparison to a leading edge commercial information retrieval system and also in comparison to the expert's golden standard of judged relevance according to established statistical correlation methods reveal enhanced system information retrieval effectiveness

    Recommender systems in industrial contexts

    Full text link
    This thesis consists of four parts: - An analysis of the core functions and the prerequisites for recommender systems in an industrial context: we identify four core functions for recommendation systems: Help do Decide, Help to Compare, Help to Explore, Help to Discover. The implementation of these functions has implications for the choices at the heart of algorithmic recommender systems. - A state of the art, which deals with the main techniques used in automated recommendation system: the two most commonly used algorithmic methods, the K-Nearest-Neighbor methods (KNN) and the fast factorization methods are detailed. The state of the art presents also purely content-based methods, hybridization techniques, and the classical performance metrics used to evaluate the recommender systems. This state of the art then gives an overview of several systems, both from academia and industry (Amazon, Google ...). - An analysis of the performances and implications of a recommendation system developed during this thesis: this system, Reperio, is a hybrid recommender engine using KNN methods. We study the performance of the KNN methods, including the impact of similarity functions used. Then we study the performance of the KNN method in critical uses cases in cold start situation. - A methodology for analyzing the performance of recommender systems in industrial context: this methodology assesses the added value of algorithmic strategies and recommendation systems according to its core functions.Comment: version 3.30, May 201
    corecore