Search CORE

9 research outputs found

SemRank: ranking refinement strategy by using the semantic intensity

Author: Aslam Nida
Loo Jonathan
Loomes Martin
RoohUllah
Ullah Irfan
Publication venue: Published by Elsevier B.V.
Publication date: 31/12/2011
Field of study

AbstractThe ubiquity of the multimedia has raised a need for the system that can store, manage, structured the multimedia data in such a way that it can be retrieved intelligently. One of the current issues in media management or data mining research is ranking of retrieved documents. Ranking is one of the provocative problems for information retrieval systems. Given a user query comes up with the millions of relevant results but if the ranking function cannot rank it according to the relevancy than all results are just obsolete. However, the current ranking techniques are in the level of keyword matching. The ranking among the results is usually done by using the term frequency. This paper is concerned with ranking the document relying merely on the rich semantic inside the document instead of the contents. Our proposed ranking refinement strategy known as SemRank, rank the document based on the semantic intensity. Our approach has been applied on the open benchmark LabelMe dataset and compared against one of the well known ranking model i.e. Vector Space Model (VSM). The experimental results depicts that our approach has achieved significant improvement in retrieval performance over the state of the art ranking methods

Elsevier - Publisher Connector

Intent-aware search result diversification

Author: Macdonald C.
Ounis I.
Santos R.L.T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Search result diversification has gained momentum as a way to tackle ambiguous queries. An effective approach to this problem is to explicitly model the possible aspects underlying a query, in order to maximise the estimated relevance of the retrieved documents with respect to the different aspects. However, such aspects themselves may represent information needs with rather distinct intents (e.g., informational or navigational). Hence, a diverse ranking could benefit from applying intent-aware retrieval models when estimating the relevance of documents to different aspects. In this paper, we propose to diversify the results retrieved for a given query, by learning the appropriateness of different retrieval models for each of the aspects underlying this query. Thorough experiments within the evaluation framework provided by the diversity task of the TREC 2009 and 2010 Web tracks show that the proposed approach can significantly improve state-of-the-art diversification approaches

CiteSeerX

Crossref

Enlighten

A cross-benchmark comparison of 87 learning to rank methods

Author: Alcântara
Busa-Fekete
Cai
Chapelle
Chapelle
Chen
Derhami
Djoerd Hiemstra
Duh
Freund
Geng
Geng
Gomes
He
Kao
Lai
Lai
Lai
Laporte
Metzler
Mohan
Niek Tax
Pahikkala
Pan
Qin
Qin
Rousseeuw
Rudin
Sander Bockting
Silva
Song
Sun
Torkestani
Torkestani
Veloso
Wang
Zong
Publication venue: Elsevier
Publication date: 01/01/2015
Field of study

Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number

University of Twente Research Information

Expansion sélective de requêtes par apprentissage

Author: Chifu Adrian-Gabriel
Mothe Josiane
Publication venue: Faculdade Santa Maria da Gloria
Publication date: 01/03/2014
Field of study

Si l’expansion de requête automatique améliore en moyenne la qualité de recherche, elle peut la dégrader pour certaines requêtes. Ainsi, certains travaux s’intéressent à développer des approches sélectives qui choisissent la fonction de recherche ou d’expansion en fonction des requêtes. La plupart des approches sélectives utilisent un processus d’apprentissage sur des caractéristiques de requêtes passées et sur les performances obtenues. Cet article présente une nouvelle méthode d’expansion sélective qui se base sur des prédicteurs de difficulté des requêtes, prédicteurs linguistiques et statistiques. Le modèle de décision est appris par un SVM. Nous montrons l’efficacité de la méthode sur des collections TREC standards. Les modèles appris ont classé les requêtes de test avec plus de 90% d’exactitude. Par ailleurs, la MAP est améliorée de plus de 11%, comparée à des méthodes non sélectives

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

HAL Descartes

Hal-Diderot

Learning to select for information retrieval

Author: Peng Jie
Publication venue
Publication date: 01/01/2010
Field of study

The effective ranking of documents in search engines is based on various document features, such as the frequency of the query terms in each document, the length, or the authoritativeness of each document. In order to obtain a better retrieval performance, instead of using a single or a few features, there is a growing trend to create a ranking function by applying a learning to rank technique on a large set of features. Learning to rank techniques aim to generate an effective document ranking function by combining a large number of document features. Different ranking functions can be generated by using different learning to rank techniques or on different document feature sets. While the generated ranking function may be uniformly applied to all queries, several studies have shown that different ranking functions favour different queries, and that the retrieval performance can be significantly enhanced if an appropriate ranking function is selected for each individual query. This thesis proposes Learning to Select (LTS), a novel framework that selectively applies an appropriate ranking function on a per-query basis, regardless of the given query's type and the number of candidate ranking functions. In the learning to select framework, the effectiveness of a ranking function for an unseen query is estimated from the available neighbouring training queries. The proposed framework employs a classification technique (e.g. k-nearest neighbour) to identify neighbouring training queries for an unseen query by using a query feature. In particular, a divergence measure (e.g. Jensen-Shannon), which determines the extent to which a document ranking function alters the scores of an initial ranking of documents for a given query, is proposed for use as a query feature. The ranking function which performs the best on the identified training query set is then chosen for the unseen query. The proposed framework is thoroughly evaluated on two different TREC retrieval tasks (namely, Web search and adhoc search tasks) and on two large standard LETOR feature sets, which contain as many as 64 document features, deriving conclusions concerning the key components of LTS, namely the query feature and the identification of neighbouring queries components. Two different types of experiments are conducted. The first one is to select an appropriate ranking function from a number of candidate ranking functions. The second one is to select multiple appropriate document features from a number of candidate document features, for building a ranking function. Experimental results show that our proposed LTS framework is effective in both selecting an appropriate ranking function and selecting multiple appropriate document features, on a per-query basis. In addition, the retrieval performance is further enhanced when increasing the number of candidates, suggesting the robustness of the learning to select framework. This thesis also demonstrates how the LTS framework can be deployed to other search applications. These applications include the selective integration of a query independent feature into a document weighting scheme (e.g. BM25), the selective estimation of the relative importance of different query aspects in a search diversification task (the goal of the task is to retrieve a ranked list of documents that provides a maximum coverage for a given query, while avoiding excessive redundancy), and the selective application of an appropriate resource for expanding and enriching a given query for document search within an enterprise. The effectiveness of the LTS framework is observed across these search applications, and on different collections, including a large scale Web collection that contains over 50 million documents. This suggests the generality of the proposed learning to select framework. The main contributions of this thesis are the introduction of the LTS framework and the proposed use of divergence measures as query features for identifying similar queries. In addition, this thesis draws insights from a large set of experiments, involving four different standard collections, four different search tasks and large document feature sets. This illustrates the effectiveness, robustness and generality of the LTS framework in tackling various retrieval applications

Glasgow Theses Service

CiteSeerX

OpenGrey Repository

Semantic multimedia modelling & interpretation for search & retrieval

Author: Aslam N.
Aslam N.
Publication venue
Publication date: 01/01/2011
Field of study

With the axiomatic revolutionary in the multimedia equip devices, culminated in the proverbial proliferation of the image and video data. Owing to this omnipresence and progression, these data become the part of our daily life. This devastating data production rate accompanies with a predicament of surpassing our potentials for acquiring this data. Perhaps one of the utmost prevailing problems of this digital era is an information plethora. Until now, progressions in image and video retrieval research reached restrained success owed to its interpretation of an image and video in terms of primitive features. Humans generally access multimedia assets in terms of semantic concepts. The retrieval of digital images and videos is impeded by the semantic gap. The semantic gap is the discrepancy between a user’s high-level interpretation of an image and the information that can be extracted from an image’s physical properties. Content- based image and video retrieval systems are explicitly assailable to the semantic gap due to their dependence on low-level visual features for describing image and content. The semantic gap can be narrowed by including high-level features. High-level descriptions of images and videos are more proficient of apprehending the semantic meaning of image and video content. It is generally understood that the problem of image and video retrieval is still far from being solved. This thesis proposes an approach for intelligent multimedia semantic extraction for search and retrieval. This thesis intends to bridge the gap between the visual features and semantics. This thesis proposes a Semantic query Interpreter for the images and the videos. The proposed Semantic Query Interpreter will select the pertinent terms from the user query and analyse it lexically and semantically. The proposed SQI reduces the semantic as well as the vocabulary gap between the users and the machine. This thesis also explored a novel ranking strategy for image search and retrieval. SemRank is the novel system that will incorporate the Semantic Intensity (SI) in exploring the semantic relevancy between the user query and the available data. The novel Semantic Intensity captures the concept dominancy factor of an image. As we are aware of the fact that the image is the combination of various concepts and among the list of concepts some of them are more dominant then the other. The SemRank will rank the retrieved images on the basis of Semantic Intensity. The investigations are made on the LabelMe image and LabelMe video dataset. Experiments show that the proposed approach is successful in bridging the semantic gap. The experiments reveal that our proposed system outperforms the traditional image retrieval systems

Middlesex University Research Repository

Adaptation des systèmes de recherche d'information aux contextes : le cas des requêtes difficiles

Author: Chifu Adrian-Gabriel
Publication venue
Publication date: 15/06/2015
Field of study

Le domaine de la recherche d'information (RI) étudie la façon de trouver des informations pertinentes dans un ou plusieurs corpus, pour répondre à un besoin d'information. Dans un Système de Recherche d'Information (SRI) les informations cherchées sont des " documents " et un besoin d'information prend la forme d'une " requête " formulée par l'utilisateur. La performance d'un SRI est dépendante de la requête. Les requêtes pour lesquelles les SRI échouent (pas ou peu de documents pertinents retrouvés) sont appelées dans la littérature des " requêtes difficiles ". Cette difficulté peut être causée par l'ambiguïté des termes, la formulation peu claire de la requête, le manque de contexte du besoin d'information, la nature et la structure de la collection de documents, etc. Cette thèse vise à adapter les systèmes de recherche d'information à des contextes, en particulier dans le cadre de requêtes difficiles. Le manuscrit est structuré en cinq chapitres principaux, outre les remerciements, l'introduction générale et les conclusions et perspectives. Le premier chapitre représente une introduction à la RI. Nous développons le concept de pertinence, les modèles de recherche de la littérature, l'expansion de requêtes et le cadre d'évaluation utilisé dans les expérimentations qui ont servi à valider nos propositions. Chacun des chapitres suivants présente une de nos contributions. Les chapitres posent les problèmes, indiquent l'état de l'art, nos propositions théoriques et leur validation sur des collections de référence. Dans le chapitre deux, nous présentons nos recherche sur la prise en compte du caractère ambigu des requêtes. L'ambiguïté des termes des requêtes peut en effet conduire à une mauvaise sélection de documents par les moteurs. Dans l'état de l'art, les méthodes de désambiguïsation qui donnent des bonnes performances sont supervisées, mais ce type de méthodes n'est pas applicable dans un contexte réel de RI, car elles nécessitent de l'information normalement indisponible. De plus, dans la littérature, la désambiguïsation de termes pour la RI est déclarée comme sous optimale. Dans ce contexte, nous proposons une méthode de désambiguïsation de requêtes non-supervisée et montrons son efficacité. Notre approche est interdisciplinaire, entre les domaines du traitement automatique du langage et la RI. L'objectif de la méthode de désambiguïsation non-supervisée que nous avons mise au point est de donner plus d'importance aux documents retrouvés par le moteur de recherche qui contient les mots de la requête avec les sens identifiés par la désambigüisation. Ce changement d'ordre des documents permet d'offrir une nouvelle liste qui contient plus de documents potentiellement pertinents pour l'utilisateur. Nous avons testé cette méthode de ré-ordonnancement des documents après désambigüisation en utilisant deux techniques de classification différentes (Naïve Bayes [Chifu et Ionescu, 2012] et classification spectrale [Chifu et al., 2015]), sur trois collections de documents et des requêtes de la compétition TREC (TREC7, TREC8, WT10G). Nous avons montré que la méthode de désambigüisation donne de bons résultats dans le cas où peu de documents pertinents sont retrouvés par le moteur de recherche (7,9% d'amélioration par rapport aux méthodes de l'état de l'art). Dans le chapitre trois, nous présentons le travail focalisé sur la prédiction de la difficulté des requêtes. En effet, si l'ambigüité est un facteur de difficulté, il n'est pas le seul. Nous avons complété la palette des prédicteurs de difficulté en nous appuyant sur l'état de l'art. Les prédicteurs existants ne sont pas suffisamment efficaces et, en conséquence, nous introduisons des nouvelles mesures de prédiction de la difficulté qui combinent les prédicteurs. Nous proposons également une méthode robuste pour évaluer les prédicteurs de difficulté des requêtes. En utilisant les combinaisons des prédicteurs, sur les collections TREC7 et TREC8, nous obtenons une amélioration de la qualité de la prédiction de 7,1% par rapport à l'état de l'art [Chifu, 2013]. Dans le quatrième chapitre nous nous intéressons à l'application des mesures de prédiction. Plus précisément, nous avons proposé une approche sélective de RI, c'est-à-dire que les prédicteurs sont utilisés pour décider quel moteur de recherche, parmi plusieurs, répondrait mieux pour une requête. Le modèle de décision est appris par un SVM (Séparateur à Vaste Marge). Nous avons testé notre modèle sur des collections de référence de TREC (Robust, WT10G, GOV2). Les modèles appris ont classé les requêtes de test avec plus de 90% d'exactitude. Par ailleurs, les résultats de la recherche ont été améliorés de plus de 11% en termes de performance, comparé à des méthodes non sélectives [Chifu et Mothe, 2014]. Dans le dernier chapitre, nous avons traité une problématique importante dans le domaine de la RI : l'expansion des requêtes par l'ajout de termes. Il est très difficile de prédire les paramètres d'expansion ou d'anticiper si une requête a besoin d'expansion, ou pas. Nous présentons notre contribution pour optimiser le paramètre lambda dans le cas de RM3 (un modèle pseudo-pertinence d'expansion des requêtes), par requête. Nous avons testé plusieurs hypothèses, à la fois avec et sans information préalable. Nous recherchons la quantité minimale d'information nécessaire pour que l'optimisation du paramètre d'expansion soit possible. Les résultats obtenus ne sont pas satisfaisants, même si nous avons utilisé une vaste plage de méthodes, comme les SVM, la régression, la régression logistique et les mesures de similarité. Par conséquent, ces observations peuvent renforcer la conclusion sur la difficulté de ce problème d'optimisation. Les recherches ont été menées non seulement au cours d'une mobilité de la recherche de trois mois à l'institut Technion de Haïfa, en Israël, en 2013, mais aussi par la suite, en gardant le contact avec l'équipe de Technion. A Haïfa, nous avons travaillé avec le professeur Oren Kurland et la doctorante Anna Shtok. En conclusion, dans cette thèse nous avons proposé de nouvelles méthodes pour améliorer les performances des systèmes de RI, en s'appuyant sur la difficulté des requêtes. Les résultats des méthodes proposées dans les chapitres deux, trois et quatre montrent des améliorations importantes et ouvrent des perspectives pour de futures recherches. L'analyse présentée dans le chapitre cinq confirme la difficulté de la problématique d'optimisation du paramètre concerné et incite à creuser plus sur le paramétrage de l'expansion sélective des requêtesThe field of information retrieval (IR) studies the mechanisms to find relevant information in one or more document collections, in order to satisfy an information need. For an Information Retrieval System (IRS) the information to find is represented by "documents" and the information need takes the form of a "query" formulated by the user. IRS performance depends on queries. Queries for which the IRS fails (little or no relevant documents retrieved) are called in the literature "difficult queries". This difficulty may be caused by term ambiguity, unclear query formulation, the lack of context for the information need, the nature and structure of the document collection, etc. This thesis aims at adapting IRS to contexts, particularly in the case of difficult queries. The manuscript is organized into five main chapters, besides acknowledgements, general introduction, conclusions and perspectives. The first chapter is an introduction to RI. We develop the concept of relevance, the retrieval models from the literature, the query expansion models and the evaluation framework that was employed to validate our proposals. Each of the following chapters presents one of our contributions. Every chapter raises the research problem, indicates the related work, our theoretical proposals and their validation on benchmark collections. In chapter two, we present our research on treating the ambiguous queries. The query term ambiguity can indeed lead to poor document retrieval of documents by the search engine. In the related work, the disambiguation methods that yield good performance are supervised, however such methods are not applicable in a real IR context, as they require the information which is normally unavailable. Moreover, in the literature, term disambiguation for IR is declared under optimal. In this context, we propose an unsupervised query disambiguation method and show its effectiveness. Our approach is interdisciplinary between the fields of natural language processing and IR. The goal of our unsupervised disambiguation method is to give more importance to the documents retrieved by the search engine that contain the query terms with the specific meaning identified by disambiguation. The document re-ranking provides a new document list that contains potentially relevant documents to the user. We tested this document re-ranking method after disambiguation using two different classification techniques (Naïve Bayes [Chifu and Ionescu, 2012] and spectral clustering [Chifu et al., 2015]), over three document collections and queries from the TREC competition (TREC7, TREC8, WT10G). We have shown that the disambiguation method in IR works well in the case of poorly performing queries (7.9% improvement compared to the methods of the state of the art). In chapter three, we present the work focused on query difficulty prediction. Indeed, if the ambiguity is a difficulty factor, it is not the only one. We completed the range of predictors of difficulty by relying on the state of the art. Existing predictors are not sufficiently effective and therefore we introduce new difficulty prediction measures that combine predictors. We also propose a robust method to evaluate difficulty predictors. Using predictor combinations, on TREC7 and TREC8 collections, we obtain an improvement of 7.1% in terms of prediction quality, compared to the state of the art [Chifu, 2013]. In the fourth chapter we focus on the application of difficulty predictors. Specifically, we proposed a selective IR approach, that is to say, predictors are employed to decide which search engine, among many, would perform better for a query. The decision model is learned by SVM (Support Vector Machine). We tested our model on TREC benchmark collections (Robust, WT10G, GOV2). The learned model classified the test queries with over 90% accuracy. Furthermore, the research results were improved by more than 11% in terms of performance, compared to non-selective methods [Chifu and Mothe, 2014]. In the last chapter, we treated an important issue in the field of IR: the query expansion by adding terms. It is very difficult to predict the expansion parameters or to anticipate whether a query needs the expansion or not. We present our contribution to optimize the lambda parameter in the case of RM3 (a pseudo-relevance model for query expansion), per query. We tested several hypotheses, both with and without prior information. We are searching for the minimum amount of information necessary in order for the optimization of the expansion parameter to be possible. The results are not satisfactory, even though we used a wide range of methods such as SVM, regression, logistic regression and similarity measures. Therefore, these findings may reinforce the conclusion regarding the difficulty of this optimization problem. The research was conducted not only during a mobility research three months at the Technion Institute in Haifa, Israel, in 2013, but thereafter, keeping in touch with the team of Technion. In Haifa, we worked with Professor Oren Kurland and PhD student Anna Shtok. In conclusion, in this thesis we proposed new methods to improve the performance of IRS, based on the query difficulty. The results of the methods proposed in chapters two, three and four show significant improvements and open perspectives for future research. The analysis in chapter five confirms the difficulty of the optimization problem of the concerned parameter and encourages thorough investigation on selective query expansion setting

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Transfer learning for information retrieval

Author: Li P
Publication venue: RMIT University
Publication date
Field of study

The lack of relevance labels is increasingly challenging and presents a bottleneck in the training of reliable learning-to-rank (L2R) models. Obtaining relevance labels using human judgment is expensive and even impossible in some scenarios. Previous research has studied different approaches to solving the problem including generating relevance labels by crowdsourcing and active learning. Recent studies have started to find ways to reuse knowledge from a related collection to help the ranking in a new collection. However, the effectiveness of a ranking function trained in one collection may be degraded when used in another collection due to the generalization issues of machine learning. Transfer learning involves a set of algorithms that are used to train or adapt a model for a target collection without sucient training labels by transferring knowledge from a related source collection with abundant labels. Transfer learning can also be applied to L2R to help train ranking functions for a new task by reusing data from a related collection while minimizing the generalization gap. Some attempts have been made to apply transfer learning techniques on L2R tasks. This thesis investigates different approaches to transfer learning methods for L2R, which are called transfer ranking. However, most of the existing studies on transfer ranking have been focused on the scenario when there are a small but not sucient number of relevance labels. The field of transfer ranking with no target collection labels is still relatively undeveloped. Moreover, the main reason why a transfer ranking solution is needed is that a ranking function trained in the source collection cannot generalize to the target collection, due to the differences in the data distribution of the two collections. However, the effect of the data distribution differences on ranking model generalization has not been examined in detail. The focus of this study is the scenario when there are no relevance labels from the new collection (the target collection), but where a related collection (the target collection) has an abundant amount of training data and labels. In this thesis, we first demonstrate the generalization gap of different L2R algorithms when the distribution of the source and target collections are different in multiple ways, and we then develop alternative solutions to tackling the problem, which includes instance weighting algorithms and self-labeling methods. Instance weighting algorithms estimate weights for each training query in the source collection according to the target query distribution and use the weighted objective function to optimize a ranking function for the target collection. The results on different test collections suggest that instance weighting methods, including existing approaches, are not reliable. The self-labeling methods use other approaches to generate imputed relevance labels for queries in the target collection, which look to transfer the ranking knowledge to the target collection by transferring the label knowledge. The algorithms were tested on various transferring scenarios and showed significant effectiveness and consistency. We thus demonstrate that the performance of self-labeling methods can be further improved with a minimal number of calibration labels from the target collection. The algorithms and knowledge developed in this thesis can help solve generic ranking knowledge transfer problems under different scenarios

RMIT Research Repository

Learning to select a ranking function

Author: J. Lin
J. Peng
R. Herbrich
S. Kirkpatrick
S. Kullback
Publication venue
Publication date: 01/01/2010
Field of study

Abstract. Learning To Rank (LTR) techniques aim to learn an effective document ranking function by combining several document features. While the function learned may be uniformly applied to all queries, many studies have shown that different ranking functions favour different queries, and the retrieval performance can be significantly enhanced if an appropriate ranking function is selected for each individual query. In this paper, we propose a novel Learning To Select framework that selectively applies an appropriate ranking function on a per-query basis. The approach employs a query feature to identify similar training queries for an unseen query. The ranking function which performs the best on this identified training query set is then chosen for the unseen query. In particular, we propose the use of divergence, which measures the extent that a document ranking function alters the scores of an initial ranking of documents for a given query, as a query feature. We evaluate our method using tasks from the TREC Web and Million Query tracks, in combination with the LETOR 3.0 and LETOR 4.0 feature sets. Our experimental results show that our proposed method is effective and robust for selecting an appropriate ranking function on a per-query basis. In particular, it always outperforms three state-of-the-art LTR techniques, namely Ranking SVM, AdaRank, and the automatic feature selection method.

CiteSeerX

Crossref