Search CORE

497 research outputs found

Parsimonious Language Models for a Terabyte of Text

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: US National Institute of Standards and Technology (NIST)
Publication date: 01/01/2008
Field of study

The aims of this paper are twofold. Our first aim\ud is to compare results of the earlier Terabyte tracks\ud to the Million Query track. We submitted a number\ud of runs using different document representations\ud (such as full-text, title-fields, or incoming\ud anchor-texts) to increase pool diversity. The initial\ud results show broad agreement in system rankings\ud over various measures on topic sets judged at both\ud Terabyte and Million Query tracks, with runs using\ud the full-text index giving superior results on\ud all measures, but also some noteworthy upsets.\ud Our second aim is to explore the use of parsimonious\ud language models for retrieval on terabyte-scale\ud collections. These models are smaller thus\ud more efficient than the standard language models\ud when used at indexing time, and they may also improve\ud retrieval performance. We have conducted\ud initial experiments using parsimonious models in\ud combination with pseudo-relevance feedback, for\ud both the Terabyte and Million Query track topic\ud sets, and obtained promising initial results

University of Twente Research Information

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Using Parsimonious Language Models on Web Data

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: ACM Press
Publication date: 01/01/2008
Field of study

In this paper we explore the use of parsimonious language models for web retrieval. These models are smaller thus more efficient than the standard language models and are therefore well suited for large-scale web retrieval. We have conducted experiments on four TREC topic sets, and found that the parsimonious language model results in improvement of retrieval effectiveness over the standard language model for all data-sets and measures. In all cases the improvement is significant, and more substantial than in earlier experiments\ud on newspaper/newswire data

CiteSeerX

Radboud Repository

University of Twente Research Information

International Migration, Integration and Social Cohesion online publications

Exploring Topic-based Language Models for Effective Web Information Retrieval

Author: Hiemstra Djoerd
Kamps Jaap
Kaptein Rianne
Li Rongmei
Publication venue: Neslia Paniculata
Publication date: 01/01/2008
Field of study

The main obstacle for providing focused search is the relative opaqueness of search request -- searchers tend to express their complex information needs in only a couple of keywords. Our overall aim is to find out if, and how, topic-based language models can lead to more effective web information retrieval. In this paper we explore retrieval performance of a topic-based model that combines topical models with other language models based on cross-entropy. We first define our topical categories and train our topical models on the .GOV2 corpus by building parsimonious language models. We then test the topic-based model on TREC8 small Web data collection for ad-hoc search.Our experimental results show that the topic-based model outperforms the standard language model and parsimonious model

University of Twente Research Information

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Index ordering by query-independent measures

Author: Alan F. Smeaton
Amento
Anh
Anh
Anh
Baeza-Yates
Broder
Büttcher
Chakrabarti
Fagni
Ferguson
Garcia
Joachims
Joachims
Kleinberg
Moffat
Ntoulas
Park
Paul Ferguson
Persin
Plachouras
Robertson
Vapnik
Wang
Witten
Xue
Zhai
Zhang
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2012
Field of study

Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

Crossref

Irish Universities

DCU Online Research Access Service

Parsimonious language models for a terabyte of text

Author: Hiemstra D.
Kamps J.
Kaptein R.
Li R.
Publication venue: 'National Institute of Standards and Technology (NIST)'
Publication date: 01/01/2008
Field of study

International Migration, Integration and Social Cohesion online publications

Experiments with positive, negative and topical relevance feedback

Author: Hiemstra D.
Kamps J.
Kaptein R.
Li R.
Publication venue: National Institute for Standards and Technology (NIST)
Publication date: 01/01/2008
Field of study

International Migration, Integration and Social Cohesion online publications

Distributed Information Retrieval using Keyword Auctions

Author: Hiemstra D.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions

CiteSeerX

Radboud Repository

University of Twente Research Information

EPiK-a Workflow for Electron Tomography in Kepler.

Author: Altintas Ilkay
Chen Ruijuan
Crawl Daniel
Ellisman Mark
Lawrence Albert
Phan Sébastien
Wan Xiaohua
Wang Jianwu
Publication venue: eScholarship, University of California
Publication date: 01/01/2014
Field of study

Scientific workflows integrate data and computing interfaces as configurable, semi-automatic graphs to solve a scientific problem. Kepler is such a software system for designing, executing, reusing, evolving, archiving and sharing scientific workflows. Electron tomography (ET) enables high-resolution views of complex cellular structures, such as cytoskeletons, organelles, viruses and chromosomes. Imaging investigations produce large datasets. For instance, in Electron Tomography, the size of a 16 fold image tilt series is about 65 Gigabytes with each projection image including 4096 by 4096 pixels. When we use serial sections or montage technique for large field ET, the dataset will be even larger. For higher resolution images with multiple tilt series, the data size may be in terabyte range. Demands of mass data processing and complex algorithms require the integration of diverse codes into flexible software structures. This paper describes a workflow for Electron Tomography Programs in Kepler (EPiK). This EPiK workflow embeds the tracking process of IMOD, and realizes the main algorithms including filtered backprojection (FBP) from TxBR and iterative reconstruction methods. We have tested the three dimensional (3D) reconstruction process using EPiK on ET data. EPiK can be a potential toolkit for biology researchers with the advantage of logical viewing, easy handling, convenient sharing and future extensibility

PubMed Central

eScholarship - University of California