Search CORE

35 research outputs found

Overview of the 2005 cross-language image retrieval track (ImageCLEF)

Author: Clough P.
Deselaers T.
Grubinger M.
Hersh W.
Jensen J.
Lehmann T.
Müller H.
Publication venue
Publication date: 01/01/2005
Field of study

The purpose of this paper is to outline efforts from the 2005 CLEF crosslanguage image retrieval campaign (ImageCLEF). The aim of this CLEF track is to explore the use of both text and content-based retrieval methods for cross-language image retrieval. Four tasks were offered in the ImageCLEF track: a ad-hoc retrieval from an historic photographic collection, ad-hoc retrieval from a medical collection, an automatic image annotation task, and a user-centered (interactive) evaluation task that is explained in the iCLEF summary. 24 research groups from a variety of backgrounds and nationalities (14 countries) participated in ImageCLEF. In this paper we describe the ImageCLEF tasks, submissions from participating groups and summarise the main fndings

White Rose Research Online

Query Expansion Strategy based on Pseudo Relevance Feedback and Term Weight Scheme for Monolingual Retrieval

Author: Das Sujoy
Srivastava Namita
Vaidyanathan Rekha
Publication venue: 'Foundation of Computer Science'
Publication date: 18/02/2015
Field of study

Query Expansion using Pseudo Relevance Feedback is a useful and a popular technique for reformulating the query. In our proposed query expansion method, we assume that relevant information can be found within a document near the central idea. The document is normally divided into sections, paragraphs and lines. The proposed method tries to extract keywords that are closer to the central theme of the document. The expansion terms are obtained by equi-frequency partition of the documents obtained from pseudo relevance feedback and by using tf-idf scores. The idf factor is calculated for number of partitions in documents. The group of words for query expansion is selected using the following approaches: the highest score, average score and a group of words that has maximum number of keywords. As each query behaved differently for different methods, the effect of these methods in selecting the words for query expansion is investigated. From this initial study, we extend the experiment to develop a rule-based statistical model that automatically selects the best group of words incorporating the tf-idf scoring and the 3 approaches explained here, in the future. The experiments were performed on FIRE 2011 Adhoc Hindi and English test collections on 50 queries each, using Terrier as retrieval engine

arXiv.org e-Print Archive

CiteSeerX

Assessing relevance using automatically translated documents for cross-language information retrieval

Author: Orengo V.
Orengo V.
Publication venue
Publication date: 01/01/2004
Field of study

This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect queries, especially if given just one attempt.The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance. In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process. In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged documents on the performance of RF is overall just moderate, and varies greatly for different query topics. This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory

Middlesex University Research Repository

Mixed-Language Arabic- English Information Retrieval

Author: Mustafa Ali Mohammed
Publication venue: Department of Computer Science
Publication date: 01/01/2013
Field of study

Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

Cape Town University OpenUCT

Evaluating Information Retrieval and Access Tasks

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book summarizes the first two decades of the NII Testbeds and Community for Information access Research (NTCIR). NTCIR is a series of evaluation forums run by a global team of researchers and hosted by the National Institute of Informatics (NII), Japan. The book is unique in that it discusses not just what was done at NTCIR, but also how it was done and the impact it has achieved. For example, in some chapters the reader sees the early seeds of what eventually grew to be the search engines that provide access to content on the World Wide Web, today’s smartphones that can tailor what they show to the needs of their owners, and the smart speakers that enrich our lives at home and on the move. We also get glimpses into how new search engines can be built for mathematical formulae, or for the digital record of a lived human life. Key to the success of the NTCIR endeavor was early recognition that information access research is an empirical discipline and that evaluation therefore lay at the core of the enterprise. Evaluation is thus at the heart of each chapter in this book. They show, for example, how the recognition that some documents are more important than others has shaped thinking about evaluation design. The thirty-three contributors to this volume speak for the many hundreds of researchers from dozens of countries around the world who together shaped NTCIR as organizers and participants. This book is suitable for researchers, practitioners, and students—anyone who wants to learn about past and present evaluation efforts in information retrieval, information access, and natural language processing, as well as those who want to participate in an evaluation task or even to design and organize one

OAPEN Library

Proceedings

Author: Gornostay Tatiana
Vasiļjevs Andrejs
Publication venue
Publication date: 10/05/2011
Field of study

Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources. Editors: Tatiana Gornostay and Andrejs Vasiļjevs. NEALT Proceedings Series, Vol. 12 (2011). © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16956

DSpace at Tartu University Library

Adaptation des systèmes de recherche d'information aux contextes : le cas des requêtes difficiles

Author: Chifu Adrian-Gabriel
Publication venue
Publication date: 15/06/2015
Field of study

Le domaine de la recherche d'information (RI) étudie la façon de trouver des informations pertinentes dans un ou plusieurs corpus, pour répondre à un besoin d'information. Dans un Système de Recherche d'Information (SRI) les informations cherchées sont des " documents " et un besoin d'information prend la forme d'une " requête " formulée par l'utilisateur. La performance d'un SRI est dépendante de la requête. Les requêtes pour lesquelles les SRI échouent (pas ou peu de documents pertinents retrouvés) sont appelées dans la littérature des " requêtes difficiles ". Cette difficulté peut être causée par l'ambiguïté des termes, la formulation peu claire de la requête, le manque de contexte du besoin d'information, la nature et la structure de la collection de documents, etc. Cette thèse vise à adapter les systèmes de recherche d'information à des contextes, en particulier dans le cadre de requêtes difficiles. Le manuscrit est structuré en cinq chapitres principaux, outre les remerciements, l'introduction générale et les conclusions et perspectives. Le premier chapitre représente une introduction à la RI. Nous développons le concept de pertinence, les modèles de recherche de la littérature, l'expansion de requêtes et le cadre d'évaluation utilisé dans les expérimentations qui ont servi à valider nos propositions. Chacun des chapitres suivants présente une de nos contributions. Les chapitres posent les problèmes, indiquent l'état de l'art, nos propositions théoriques et leur validation sur des collections de référence. Dans le chapitre deux, nous présentons nos recherche sur la prise en compte du caractère ambigu des requêtes. L'ambiguïté des termes des requêtes peut en effet conduire à une mauvaise sélection de documents par les moteurs. Dans l'état de l'art, les méthodes de désambiguïsation qui donnent des bonnes performances sont supervisées, mais ce type de méthodes n'est pas applicable dans un contexte réel de RI, car elles nécessitent de l'information normalement indisponible. De plus, dans la littérature, la désambiguïsation de termes pour la RI est déclarée comme sous optimale. Dans ce contexte, nous proposons une méthode de désambiguïsation de requêtes non-supervisée et montrons son efficacité. Notre approche est interdisciplinaire, entre les domaines du traitement automatique du langage et la RI. L'objectif de la méthode de désambiguïsation non-supervisée que nous avons mise au point est de donner plus d'importance aux documents retrouvés par le moteur de recherche qui contient les mots de la requête avec les sens identifiés par la désambigüisation. Ce changement d'ordre des documents permet d'offrir une nouvelle liste qui contient plus de documents potentiellement pertinents pour l'utilisateur. Nous avons testé cette méthode de ré-ordonnancement des documents après désambigüisation en utilisant deux techniques de classification différentes (Naïve Bayes [Chifu et Ionescu, 2012] et classification spectrale [Chifu et al., 2015]), sur trois collections de documents et des requêtes de la compétition TREC (TREC7, TREC8, WT10G). Nous avons montré que la méthode de désambigüisation donne de bons résultats dans le cas où peu de documents pertinents sont retrouvés par le moteur de recherche (7,9% d'amélioration par rapport aux méthodes de l'état de l'art). Dans le chapitre trois, nous présentons le travail focalisé sur la prédiction de la difficulté des requêtes. En effet, si l'ambigüité est un facteur de difficulté, il n'est pas le seul. Nous avons complété la palette des prédicteurs de difficulté en nous appuyant sur l'état de l'art. Les prédicteurs existants ne sont pas suffisamment efficaces et, en conséquence, nous introduisons des nouvelles mesures de prédiction de la difficulté qui combinent les prédicteurs. Nous proposons également une méthode robuste pour évaluer les prédicteurs de difficulté des requêtes. En utilisant les combinaisons des prédicteurs, sur les collections TREC7 et TREC8, nous obtenons une amélioration de la qualité de la prédiction de 7,1% par rapport à l'état de l'art [Chifu, 2013]. Dans le quatrième chapitre nous nous intéressons à l'application des mesures de prédiction. Plus précisément, nous avons proposé une approche sélective de RI, c'est-à-dire que les prédicteurs sont utilisés pour décider quel moteur de recherche, parmi plusieurs, répondrait mieux pour une requête. Le modèle de décision est appris par un SVM (Séparateur à Vaste Marge). Nous avons testé notre modèle sur des collections de référence de TREC (Robust, WT10G, GOV2). Les modèles appris ont classé les requêtes de test avec plus de 90% d'exactitude. Par ailleurs, les résultats de la recherche ont été améliorés de plus de 11% en termes de performance, comparé à des méthodes non sélectives [Chifu et Mothe, 2014]. Dans le dernier chapitre, nous avons traité une problématique importante dans le domaine de la RI : l'expansion des requêtes par l'ajout de termes. Il est très difficile de prédire les paramètres d'expansion ou d'anticiper si une requête a besoin d'expansion, ou pas. Nous présentons notre contribution pour optimiser le paramètre lambda dans le cas de RM3 (un modèle pseudo-pertinence d'expansion des requêtes), par requête. Nous avons testé plusieurs hypothèses, à la fois avec et sans information préalable. Nous recherchons la quantité minimale d'information nécessaire pour que l'optimisation du paramètre d'expansion soit possible. Les résultats obtenus ne sont pas satisfaisants, même si nous avons utilisé une vaste plage de méthodes, comme les SVM, la régression, la régression logistique et les mesures de similarité. Par conséquent, ces observations peuvent renforcer la conclusion sur la difficulté de ce problème d'optimisation. Les recherches ont été menées non seulement au cours d'une mobilité de la recherche de trois mois à l'institut Technion de Haïfa, en Israël, en 2013, mais aussi par la suite, en gardant le contact avec l'équipe de Technion. A Haïfa, nous avons travaillé avec le professeur Oren Kurland et la doctorante Anna Shtok. En conclusion, dans cette thèse nous avons proposé de nouvelles méthodes pour améliorer les performances des systèmes de RI, en s'appuyant sur la difficulté des requêtes. Les résultats des méthodes proposées dans les chapitres deux, trois et quatre montrent des améliorations importantes et ouvrent des perspectives pour de futures recherches. L'analyse présentée dans le chapitre cinq confirme la difficulté de la problématique d'optimisation du paramètre concerné et incite à creuser plus sur le paramétrage de l'expansion sélective des requêtesThe field of information retrieval (IR) studies the mechanisms to find relevant information in one or more document collections, in order to satisfy an information need. For an Information Retrieval System (IRS) the information to find is represented by "documents" and the information need takes the form of a "query" formulated by the user. IRS performance depends on queries. Queries for which the IRS fails (little or no relevant documents retrieved) are called in the literature "difficult queries". This difficulty may be caused by term ambiguity, unclear query formulation, the lack of context for the information need, the nature and structure of the document collection, etc. This thesis aims at adapting IRS to contexts, particularly in the case of difficult queries. The manuscript is organized into five main chapters, besides acknowledgements, general introduction, conclusions and perspectives. The first chapter is an introduction to RI. We develop the concept of relevance, the retrieval models from the literature, the query expansion models and the evaluation framework that was employed to validate our proposals. Each of the following chapters presents one of our contributions. Every chapter raises the research problem, indicates the related work, our theoretical proposals and their validation on benchmark collections. In chapter two, we present our research on treating the ambiguous queries. The query term ambiguity can indeed lead to poor document retrieval of documents by the search engine. In the related work, the disambiguation methods that yield good performance are supervised, however such methods are not applicable in a real IR context, as they require the information which is normally unavailable. Moreover, in the literature, term disambiguation for IR is declared under optimal. In this context, we propose an unsupervised query disambiguation method and show its effectiveness. Our approach is interdisciplinary between the fields of natural language processing and IR. The goal of our unsupervised disambiguation method is to give more importance to the documents retrieved by the search engine that contain the query terms with the specific meaning identified by disambiguation. The document re-ranking provides a new document list that contains potentially relevant documents to the user. We tested this document re-ranking method after disambiguation using two different classification techniques (Naïve Bayes [Chifu and Ionescu, 2012] and spectral clustering [Chifu et al., 2015]), over three document collections and queries from the TREC competition (TREC7, TREC8, WT10G). We have shown that the disambiguation method in IR works well in the case of poorly performing queries (7.9% improvement compared to the methods of the state of the art). In chapter three, we present the work focused on query difficulty prediction. Indeed, if the ambiguity is a difficulty factor, it is not the only one. We completed the range of predictors of difficulty by relying on the state of the art. Existing predictors are not sufficiently effective and therefore we introduce new difficulty prediction measures that combine predictors. We also propose a robust method to evaluate difficulty predictors. Using predictor combinations, on TREC7 and TREC8 collections, we obtain an improvement of 7.1% in terms of prediction quality, compared to the state of the art [Chifu, 2013]. In the fourth chapter we focus on the application of difficulty predictors. Specifically, we proposed a selective IR approach, that is to say, predictors are employed to decide which search engine, among many, would perform better for a query. The decision model is learned by SVM (Support Vector Machine). We tested our model on TREC benchmark collections (Robust, WT10G, GOV2). The learned model classified the test queries with over 90% accuracy. Furthermore, the research results were improved by more than 11% in terms of performance, compared to non-selective methods [Chifu and Mothe, 2014]. In the last chapter, we treated an important issue in the field of IR: the query expansion by adding terms. It is very difficult to predict the expansion parameters or to anticipate whether a query needs the expansion or not. We present our contribution to optimize the lambda parameter in the case of RM3 (a pseudo-relevance model for query expansion), per query. We tested several hypotheses, both with and without prior information. We are searching for the minimum amount of information necessary in order for the optimization of the expansion parameter to be possible. The results are not satisfactory, even though we used a wide range of methods such as SVM, regression, logistic regression and similarity measures. Therefore, these findings may reinforce the conclusion regarding the difficulty of this optimization problem. The research was conducted not only during a mobility research three months at the Technion Institute in Haifa, Israel, in 2013, but thereafter, keeping in touch with the team of Technion. In Haifa, we worked with Professor Oren Kurland and PhD student Anna Shtok. In conclusion, in this thesis we proposed new methods to improve the performance of IRS, based on the query difficulty. The results of the methods proposed in chapters two, three and four show significant improvements and open perspectives for future research. The analysis in chapter five confirms the difficulty of the optimization problem of the concerned parameter and encourages thorough investigation on selective query expansion setting

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Recommended from our members

Efficient Inference, Search and Evaluation for Latent Variable Models of Text with Applications to Information Retrieval and Machine Translation

Author: Krstovski Kriste
Publication venue: ScholarWorks@UMass Amherst
Publication date: 13/07/2016
Field of study

Latent variable models of text, such as topic models, have been explored in many areas of natural language processing, information retrieval and machine translation to aid tasks such as exploratory data analysis, automated topic clustering and finding similar documents in mono- and multilingual collections. Many additional applications of these models, however, could be enabled by more efficient techniques for processing large datasets. In this thesis, we introduce novel methods that offer efficient inference, search and evaluation for latent variable models of text. We present efficient, online inference for representing documents in several languages in a common topic space and fast approximations for finding near neighbors in the probability simplex representation of mono- and multilingual document collections. Empirical evaluations show that these methods are as accurate as —- and significantly faster than —- Gibbs sampling and brute-force all pairs search respectively. In addition, we present a new extrinsic evaluation metric that achieves very high correlation with common performance metrics while being more efficient to compute. We showcase the efficacy and efficiency of our new approaches on the problems of modeling and finding similar documents in a retrieval system for scientific papers, detecting document translation pairs, and extracting parallel sentences from large comparable corpora. This last task, in turn, allows us to efficiently train a translation model from comparable corpora that outperforms a model trained on parallel data. Lastly, we improve the latent variable model representation of large documents in mono- and multilingual collections by introducing online inference for topic models with hierarchical Dirichlet prior structure over textual regions such as document sections. Modeling variations across textual regions using online inference offers a more effective and efficient document representation, beyond a bag of words, which is usually a handicap for the performance of these models on large documents

ScholarWorks@UMass Amherst