Search CORE

220 research outputs found

Dublin City University at the TREC 2005 terabyte track

Author: Ferguson Paul
Gurrin Cathal
Smeaton Alan F.
Wilkins Peter
Publication venue: 'University of Aden - Faculty of Economics and Administration'
Publication date: 01/11/2005
Field of study

For the 2005 Terabyte track in TREC Dublin City University participated in all three tasks: Adhoc, E±ciency and Named Page Finding. Our runs for TREC in all tasks were primarily focussed on the application of "Top Subset Retrieval" to the Terabyte Track. This retrieval utilises different types of sorted inverted indices so that less documents are processed in order to reduce query times, and is done so in a way that minimises loss of effectiveness in terms of query precision. We also compare a distributed version of our Físréal search system [1][2] against the same system deployed on a single machine

Irish Universities

DCU Online Research Access Service

Parsimonious Language Models for a Terabyte of Text

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: US National Institute of Standards and Technology (NIST)
Publication date: 01/01/2008
Field of study

The aims of this paper are twofold. Our first aim\ud is to compare results of the earlier Terabyte tracks\ud to the Million Query track. We submitted a number\ud of runs using different document representations\ud (such as full-text, title-fields, or incoming\ud anchor-texts) to increase pool diversity. The initial\ud results show broad agreement in system rankings\ud over various measures on topic sets judged at both\ud Terabyte and Million Query tracks, with runs using\ud the full-text index giving superior results on\ud all measures, but also some noteworthy upsets.\ud Our second aim is to explore the use of parsimonious\ud language models for retrieval on terabyte-scale\ud collections. These models are smaller thus\ud more efficient than the standard language models\ud when used at indexing time, and they may also improve\ud retrieval performance. We have conducted\ud initial experiments using parsimonious models in\ud combination with pseudo-relevance feedback, for\ud both the Terabyte and Million Query track topic\ud sets, and obtained promising initial results

University of Twente Research Information

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Using Parsimonious Language Models on Web Data

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: ACM Press
Publication date: 01/01/2008
Field of study

In this paper we explore the use of parsimonious language models for web retrieval. These models are smaller thus more efficient than the standard language models and are therefore well suited for large-scale web retrieval. We have conducted experiments on four TREC topic sets, and found that the parsimonious language model results in improvement of retrieval effectiveness over the standard language model for all data-sets and measures. In all cases the improvement is significant, and more substantial than in earlier experiments\ud on newspaper/newswire data

CiteSeerX

Radboud Repository

University of Twente Research Information

International Migration, Integration and Social Cohesion online publications

Index ordering by query-independent measures

Author: Alan F. Smeaton
Amento
Anh
Anh
Anh
Baeza-Yates
Broder
Büttcher
Chakrabarti
Fagni
Ferguson
Garcia
Joachims
Joachims
Kleinberg
Moffat
Ntoulas
Park
Paul Ferguson
Persin
Plachouras
Robertson
Vapnik
Wang
Witten
Xue
Zhai
Zhang
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2012
Field of study

Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

Crossref

Irish Universities

DCU Online Research Access Service

Evaluating epistemic uncertainty under incomplete assessments

Author: Barry
Blair
Blair
Blair
Blair
Harter
Hull
Ian Ruthven
Ingwersen
Järvelin
Leif Azzopardi
Mark Baillie
Popper
Ruthven
Salton
Saracevic
Savoy
Schamber
Soboroff
Swanson
Swanson
Van Rijsbergen
Voorhees
Voorhees
Voorhees
Wallis
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

The thesis of this study is to propose an extended methodology for laboratory based Information Retrieval evaluation under incomplete relevance assessments. This new methodology aims to identify potential uncertainty during system comparison that may result from incompleteness. The adoption of this methodology is advantageous, because the detection of epistemic uncertainty - the amount of knowledge (or ignorance) we have about the estimate of a system's performance - during the evaluation process can guide and direct researchers when evaluating new systems over existing and future test collections. Across a series of experiments we demonstrate how this methodology can lead towards a finer grained analysis of systems. In particular, we show through experimentation how the current practice in Information Retrieval evaluation of using a measurement depth larger than the pooling depth increases uncertainty during system comparison

CiteSeerX

Crossref

University of Strathclyde Institutional Repository

Enlighten

Exploring Topic-based Language Models for Effective Web Information Retrieval

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: Neslia Paniculata
Publication date: 01/01/2008
Field of study

The main obstacle for providing focused search is the relative opaqueness of search request -- searchers tend to express their complex information needs in only a couple of keywords. Our overall aim is to find out if, and how, topic-based language models can lead to more effective web information retrieval. In this paper we explore retrieval performance of a topic-based model that combines topical models with other language models based on cross-entropy. We first define our topical categories and train our topical models on the .GOV2 corpus by building parsimonious language models. We then test the topic-based model on TREC8 small Web data collection for ad-hoc search.Our experimental results show that the topic-based model outperforms the standard language model and parsimonious model

University of Twente Research Information

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Experiments with positive, negative and topical relevance feedback

Author: Hiemstra D.
Kamps J.
Kaptein R.
Li R.
Publication venue: National Institute for Standards and Technology (NIST)
Publication date: 01/01/2008
Field of study

International Migration, Integration and Social Cohesion online publications

Relevance-based Word Embedding

Author: Ai Qingyao
Croft Bruce
Dehghani Mostafa
Diaz Fernando
Jing Yufeng
Kusner Matt J.
Levy Omer
Mnih Andriy
Morin Frederic
Nasreen
Rekabsaz Navid
Tao Tao
Vuliç Ivan
Xu Jinxi
Zamani Hamed
Zamani Hamed
Publication venue
Publication date: 16/07/2017
Field of study

Learning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned based on term proximity in a large corpus. This means that the objective in well-known word embedding algorithms, e.g., word2vec, is to accurately predict adjacent word(s) for a given word or context. However, this objective is not necessarily equivalent to the goal of many information retrieval (IR) tasks. The primary objective in various IR tasks is to capture relevance instead of term proximity, syntactic, or even semantic similarity. This is the motivation for developing unsupervised relevance-based word embedding models that learn word representations based on query-document relevance information. In this paper, we propose two learning models with different objective functions; one learns a relevance distribution over the vocabulary set for each query, and the other classifies each term as belonging to the relevant or non-relevant class for each query. To train our models, we used over six million unique queries and the top ranked documents retrieved in response to each query, which are assumed to be relevant to the query. We extrinsically evaluate our learned word representation models using two IR tasks: query expansion and query classification. Both query expansion experiments on four TREC collections and query classification experiments on the KDD Cup 2005 dataset suggest that the relevance-based word embedding models significantly outperform state-of-the-art proximity-based embedding models, such as word2vec and GloVe.Comment: to appear in the proceedings of The 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '17

arXiv.org e-Print Archive

Crossref

Web Page Retrieval by Combining Evidence

Author: Alonso-Berrocal José-Luis
G.-Figuerola Carlos
Rodríguez-Vázquez-de-Aldana Emilio
Zazo Ángel F.
Publication venue
Publication date: 01/01/2006
Field of study

The participation of the REINA Research Group in WebCLEF 2005 focused in the monolingual mixed task. Queries or topics are of two types: named and home pages. For both, we first perform a search by thematic contents; for the same query, we do a search in several elements of information from every page (title, some meta tags, anchor text) and then we combine the results. For queries about home pages, we try to detect using a method based in some keywords and their patterns of use. After, a re-rank of the results of the thematic contents retrieval is performed, based on Page-Rank and Centrality coeficients

E-LIS

Enhancing access to the Bibliome: the TREC 2004 Genomics Track

Author: Aaron M Cohen
Dale F Kraemer
Laura Ross
Phoebe Roberts
Phoebe Roberts
Ravi Teja Bhupatiraju
William R Hersh
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of the Text Retrieval Conference (TREC) 2004, a forum for evaluation of IR research systems, where retrieval in the genomics domain has recently begun to be assessed. RESULTS: A total of 27 research groups submitted 47 different runs. The most effective runs, as measured by the primary evaluation measure of mean average precision (MAP), used a combination of domain-specific and general techniques. The best MAP obtained by any run was 0.4075. Techniques that expanded queries with gene name lists as well as words from related articles had the best efficacy. However, many runs performed more poorly than a simple baseline run, indicating that careful selection of system features is essential. CONCLUSION: Various approaches to ad hoc retrieval provide a diversity of efficacy. The TREC Genomics Track and its test collection resources provide tools that allow improvement in information retrieval systems

CiteSeerX

Springer - Publisher Connector

PubMed Central