6,344 research outputs found

    Predicting IR Personalization Performance using Pre-retrieval Query Predictors

    Get PDF
    Personalization generally improves the performance of queries but in a few cases it may also harms it. If we are able to predict and therefore to disable personalization for those situations, the overall performance will be higher and users will be more satisfied with personalized systems. We use some state-of-the-art pre-retrieval query performance predictors and propose some others including the user profile information for the previous purpose. We study the correlations among these predictors and the difference between the personalized and the original queries. We also use classification and regression techniques to improve the results and finally reach a bit more than one third of the maximum ideal performance. We think this is a good starting point within this research line, which certainly needs more effort and improvements.This work has been supported by the Spanish Andalusian “Consejerı́a de InnovaciĂłn, Ciencia y Empresa” postdoctoral phase of project P09-TIC-4526, the Spanish “Ministerio de Economı́a y Competitividad” projects TIN2013-42741-P and TIN2016-77902-C3-2-P, and the European Regional Development Fund (ERDF-FEDER)

    Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video

    Get PDF
    Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content

    Learning to Choose the Best System Configuration in Information Retrieval: the case of repeated queries

    Get PDF
    This paper presents a method that automatically decides which system configuration should be used to process a query. This method is developed for the case of repeated queries and implements a new kind of meta-system. It is based on a training process: the meta-system learns the best system configuration to use on a per query basis. After training, the meta-search system knows which configuration should treat a given query. The Learning to Choose method we developed selects the best configurations among many. This selective process rests on data analytics applied to system parameter values and their link with system effectiveness. Moreover, we optimize the parameters on a per-query basis. The training phase uses a limited amount of document relevance judgment. When the query is repeated or when an equal-query is submitted to the system, the meta-system automatically knows which parameters it should use to treat the query. This method its the case of changing collections since what is learnt is the relationship between a query and the best parameters to use to process it, rather than the relationship between a query and documents to retrieve. In this paper, we describe how data analysis can help to select among various configurations the ones that will be useful. The "Learning to choose" method is presented and evaluated using simulated data from TREC campaigns. We show that system performance highly increases in terms of precision, specifically for the queries that are difficult or medium difficult to answer. The other parameters of the method are also studied

    Predicting IR Personalization Performance using Pre-retrieval Query Predictors

    Full text link
    Personalization generally improves the performance of queries but in a few cases it may also harms it. If we are able to predict and therefore to disable personalization for those situations, the overall performance will be higher and users will be more satisfied with personalized systems. We use some state-of-the-art pre-retrieval query performance predictors and propose some others including the user profile information for the previous purpose. We study the correlations among these predictors and the difference between the personalized and the original queries. We also use classification and regression techniques to improve the results and finally reach a bit more than one third of the maximum ideal performance. We think this is a good starting point within this research line, which certainly needs more effort and improvements

    Adaptation des systĂšmes de recherche d'information aux contextes : le cas des requĂȘtes difficiles

    Get PDF
    Le domaine de la recherche d'information (RI) Ă©tudie la façon de trouver des informations pertinentes dans un ou plusieurs corpus, pour rĂ©pondre Ă  un besoin d'information. Dans un SystĂšme de Recherche d'Information (SRI) les informations cherchĂ©es sont des " documents " et un besoin d'information prend la forme d'une " requĂȘte " formulĂ©e par l'utilisateur. La performance d'un SRI est dĂ©pendante de la requĂȘte. Les requĂȘtes pour lesquelles les SRI Ă©chouent (pas ou peu de documents pertinents retrouvĂ©s) sont appelĂ©es dans la littĂ©rature des " requĂȘtes difficiles ". Cette difficultĂ© peut ĂȘtre causĂ©e par l'ambiguĂŻtĂ© des termes, la formulation peu claire de la requĂȘte, le manque de contexte du besoin d'information, la nature et la structure de la collection de documents, etc. Cette thĂšse vise Ă  adapter les systĂšmes de recherche d'information Ă  des contextes, en particulier dans le cadre de requĂȘtes difficiles. Le manuscrit est structurĂ© en cinq chapitres principaux, outre les remerciements, l'introduction gĂ©nĂ©rale et les conclusions et perspectives. Le premier chapitre reprĂ©sente une introduction Ă  la RI. Nous dĂ©veloppons le concept de pertinence, les modĂšles de recherche de la littĂ©rature, l'expansion de requĂȘtes et le cadre d'Ă©valuation utilisĂ© dans les expĂ©rimentations qui ont servi Ă  valider nos propositions. Chacun des chapitres suivants prĂ©sente une de nos contributions. Les chapitres posent les problĂšmes, indiquent l'Ă©tat de l'art, nos propositions thĂ©oriques et leur validation sur des collections de rĂ©fĂ©rence. Dans le chapitre deux, nous prĂ©sentons nos recherche sur la prise en compte du caractĂšre ambigu des requĂȘtes. L'ambiguĂŻtĂ© des termes des requĂȘtes peut en effet conduire Ă  une mauvaise sĂ©lection de documents par les moteurs. Dans l'Ă©tat de l'art, les mĂ©thodes de dĂ©sambiguĂŻsation qui donnent des bonnes performances sont supervisĂ©es, mais ce type de mĂ©thodes n'est pas applicable dans un contexte rĂ©el de RI, car elles nĂ©cessitent de l'information normalement indisponible. De plus, dans la littĂ©rature, la dĂ©sambiguĂŻsation de termes pour la RI est dĂ©clarĂ©e comme sous optimale. Dans ce contexte, nous proposons une mĂ©thode de dĂ©sambiguĂŻsation de requĂȘtes non-supervisĂ©e et montrons son efficacitĂ©. Notre approche est interdisciplinaire, entre les domaines du traitement automatique du langage et la RI. L'objectif de la mĂ©thode de dĂ©sambiguĂŻsation non-supervisĂ©e que nous avons mise au point est de donner plus d'importance aux documents retrouvĂ©s par le moteur de recherche qui contient les mots de la requĂȘte avec les sens identifiĂ©s par la dĂ©sambigĂŒisation. Ce changement d'ordre des documents permet d'offrir une nouvelle liste qui contient plus de documents potentiellement pertinents pour l'utilisateur. Nous avons testĂ© cette mĂ©thode de rĂ©-ordonnancement des documents aprĂšs dĂ©sambigĂŒisation en utilisant deux techniques de classification diffĂ©rentes (NaĂŻve Bayes [Chifu et Ionescu, 2012] et classification spectrale [Chifu et al., 2015]), sur trois collections de documents et des requĂȘtes de la compĂ©tition TREC (TREC7, TREC8, WT10G). Nous avons montrĂ© que la mĂ©thode de dĂ©sambigĂŒisation donne de bons rĂ©sultats dans le cas oĂč peu de documents pertinents sont retrouvĂ©s par le moteur de recherche (7,9% d'amĂ©lioration par rapport aux mĂ©thodes de l'Ă©tat de l'art). Dans le chapitre trois, nous prĂ©sentons le travail focalisĂ© sur la prĂ©diction de la difficultĂ© des requĂȘtes. En effet, si l'ambigĂŒitĂ© est un facteur de difficultĂ©, il n'est pas le seul. Nous avons complĂ©tĂ© la palette des prĂ©dicteurs de difficultĂ© en nous appuyant sur l'Ă©tat de l'art. Les prĂ©dicteurs existants ne sont pas suffisamment efficaces et, en consĂ©quence, nous introduisons des nouvelles mesures de prĂ©diction de la difficultĂ© qui combinent les prĂ©dicteurs. Nous proposons Ă©galement une mĂ©thode robuste pour Ă©valuer les prĂ©dicteurs de difficultĂ© des requĂȘtes. En utilisant les combinaisons des prĂ©dicteurs, sur les collections TREC7 et TREC8, nous obtenons une amĂ©lioration de la qualitĂ© de la prĂ©diction de 7,1% par rapport Ă  l'Ă©tat de l'art [Chifu, 2013]. Dans le quatriĂšme chapitre nous nous intĂ©ressons Ă  l'application des mesures de prĂ©diction. Plus prĂ©cisĂ©ment, nous avons proposĂ© une approche sĂ©lective de RI, c'est-Ă -dire que les prĂ©dicteurs sont utilisĂ©s pour dĂ©cider quel moteur de recherche, parmi plusieurs, rĂ©pondrait mieux pour une requĂȘte. Le modĂšle de dĂ©cision est appris par un SVM (SĂ©parateur Ă  Vaste Marge). Nous avons testĂ© notre modĂšle sur des collections de rĂ©fĂ©rence de TREC (Robust, WT10G, GOV2). Les modĂšles appris ont classĂ© les requĂȘtes de test avec plus de 90% d'exactitude. Par ailleurs, les rĂ©sultats de la recherche ont Ă©tĂ© amĂ©liorĂ©s de plus de 11% en termes de performance, comparĂ© Ă  des mĂ©thodes non sĂ©lectives [Chifu et Mothe, 2014]. Dans le dernier chapitre, nous avons traitĂ© une problĂ©matique importante dans le domaine de la RI : l'expansion des requĂȘtes par l'ajout de termes. Il est trĂšs difficile de prĂ©dire les paramĂštres d'expansion ou d'anticiper si une requĂȘte a besoin d'expansion, ou pas. Nous prĂ©sentons notre contribution pour optimiser le paramĂštre lambda dans le cas de RM3 (un modĂšle pseudo-pertinence d'expansion des requĂȘtes), par requĂȘte. Nous avons testĂ© plusieurs hypothĂšses, Ă  la fois avec et sans information prĂ©alable. Nous recherchons la quantitĂ© minimale d'information nĂ©cessaire pour que l'optimisation du paramĂštre d'expansion soit possible. Les rĂ©sultats obtenus ne sont pas satisfaisants, mĂȘme si nous avons utilisĂ© une vaste plage de mĂ©thodes, comme les SVM, la rĂ©gression, la rĂ©gression logistique et les mesures de similaritĂ©. Par consĂ©quent, ces observations peuvent renforcer la conclusion sur la difficultĂ© de ce problĂšme d'optimisation. Les recherches ont Ă©tĂ© menĂ©es non seulement au cours d'une mobilitĂ© de la recherche de trois mois Ă  l'institut Technion de HaĂŻfa, en IsraĂ«l, en 2013, mais aussi par la suite, en gardant le contact avec l'Ă©quipe de Technion. A HaĂŻfa, nous avons travaillĂ© avec le professeur Oren Kurland et la doctorante Anna Shtok. En conclusion, dans cette thĂšse nous avons proposĂ© de nouvelles mĂ©thodes pour amĂ©liorer les performances des systĂšmes de RI, en s'appuyant sur la difficultĂ© des requĂȘtes. Les rĂ©sultats des mĂ©thodes proposĂ©es dans les chapitres deux, trois et quatre montrent des amĂ©liorations importantes et ouvrent des perspectives pour de futures recherches. L'analyse prĂ©sentĂ©e dans le chapitre cinq confirme la difficultĂ© de la problĂ©matique d'optimisation du paramĂštre concernĂ© et incite Ă  creuser plus sur le paramĂ©trage de l'expansion sĂ©lective des requĂȘtesThe field of information retrieval (IR) studies the mechanisms to find relevant information in one or more document collections, in order to satisfy an information need. For an Information Retrieval System (IRS) the information to find is represented by "documents" and the information need takes the form of a "query" formulated by the user. IRS performance depends on queries. Queries for which the IRS fails (little or no relevant documents retrieved) are called in the literature "difficult queries". This difficulty may be caused by term ambiguity, unclear query formulation, the lack of context for the information need, the nature and structure of the document collection, etc. This thesis aims at adapting IRS to contexts, particularly in the case of difficult queries. The manuscript is organized into five main chapters, besides acknowledgements, general introduction, conclusions and perspectives. The first chapter is an introduction to RI. We develop the concept of relevance, the retrieval models from the literature, the query expansion models and the evaluation framework that was employed to validate our proposals. Each of the following chapters presents one of our contributions. Every chapter raises the research problem, indicates the related work, our theoretical proposals and their validation on benchmark collections. In chapter two, we present our research on treating the ambiguous queries. The query term ambiguity can indeed lead to poor document retrieval of documents by the search engine. In the related work, the disambiguation methods that yield good performance are supervised, however such methods are not applicable in a real IR context, as they require the information which is normally unavailable. Moreover, in the literature, term disambiguation for IR is declared under optimal. In this context, we propose an unsupervised query disambiguation method and show its effectiveness. Our approach is interdisciplinary between the fields of natural language processing and IR. The goal of our unsupervised disambiguation method is to give more importance to the documents retrieved by the search engine that contain the query terms with the specific meaning identified by disambiguation. The document re-ranking provides a new document list that contains potentially relevant documents to the user. We tested this document re-ranking method after disambiguation using two different classification techniques (NaĂŻve Bayes [Chifu and Ionescu, 2012] and spectral clustering [Chifu et al., 2015]), over three document collections and queries from the TREC competition (TREC7, TREC8, WT10G). We have shown that the disambiguation method in IR works well in the case of poorly performing queries (7.9% improvement compared to the methods of the state of the art). In chapter three, we present the work focused on query difficulty prediction. Indeed, if the ambiguity is a difficulty factor, it is not the only one. We completed the range of predictors of difficulty by relying on the state of the art. Existing predictors are not sufficiently effective and therefore we introduce new difficulty prediction measures that combine predictors. We also propose a robust method to evaluate difficulty predictors. Using predictor combinations, on TREC7 and TREC8 collections, we obtain an improvement of 7.1% in terms of prediction quality, compared to the state of the art [Chifu, 2013]. In the fourth chapter we focus on the application of difficulty predictors. Specifically, we proposed a selective IR approach, that is to say, predictors are employed to decide which search engine, among many, would perform better for a query. The decision model is learned by SVM (Support Vector Machine). We tested our model on TREC benchmark collections (Robust, WT10G, GOV2). The learned model classified the test queries with over 90% accuracy. Furthermore, the research results were improved by more than 11% in terms of performance, compared to non-selective methods [Chifu and Mothe, 2014]. In the last chapter, we treated an important issue in the field of IR: the query expansion by adding terms. It is very difficult to predict the expansion parameters or to anticipate whether a query needs the expansion or not. We present our contribution to optimize the lambda parameter in the case of RM3 (a pseudo-relevance model for query expansion), per query. We tested several hypotheses, both with and without prior information. We are searching for the minimum amount of information necessary in order for the optimization of the expansion parameter to be possible. The results are not satisfactory, even though we used a wide range of methods such as SVM, regression, logistic regression and similarity measures. Therefore, these findings may reinforce the conclusion regarding the difficulty of this optimization problem. The research was conducted not only during a mobility research three months at the Technion Institute in Haifa, Israel, in 2013, but thereafter, keeping in touch with the team of Technion. In Haifa, we worked with Professor Oren Kurland and PhD student Anna Shtok. In conclusion, in this thesis we proposed new methods to improve the performance of IRS, based on the query difficulty. The results of the methods proposed in chapters two, three and four show significant improvements and open perspectives for future research. The analysis in chapter five confirms the difficulty of the optimization problem of the concerned parameter and encourages thorough investigation on selective query expansion setting

    Automatic Classification of Queries by Expected Retrieval Performance

    Get PDF
    International audienceThis paper presents a method for automatically predicting a degree of average relevance of a retrieved document set returned by a retrieval system in response to a query. For a given retrieval system and document collection, prediction is conceived as query classification. Two classes of queries have been defined: easy and hard. The split point between those two classes is the median value of the average precision over the query collection. This paper proposes several classifiers that select useful features among a set of candidates and use them to predict the class of a query. Classifiers are trained on the results of the systems involved in the TREC 8 campaign. Due to the limited number of available queries, training and test are performed with the leave-one-out and 10-fold cross-validation methods. Two types of classifiers, namely decision trees and support vector machines provide particularly interesting results for a number of systems. A fairly high classification accuracy is obtained using the TREC 8 data (more than 80% of correct prediction in some settings)
    • 

    corecore