29 research outputs found
Methods for Distributed Information Retrieval
Published methods for distributed information retrieval generally rely on cooperation from search servers. But most real servers, particularly the tens of thousands available on the Web, are not engineered for such cooperation. This means that the majority of methods proposed, and evaluated in simulated environments of homogeneous cooperating servers, are never applied in practice. ¶ This thesis introduces new methods for server selection and results merging. The methods do not require search servers to cooperate, yet are as effective as the best methods which do. Two large experiments evaluate the new methods against many previously published methods. In contrast to previous experiments they simulate a Web-like environment, where servers employ varied retrieval algorithms and tend not to sub-partition documents from a single source. ..
Recommended from our members
PLIERS at VLC2
This paper describes experiments done on the VLC2 collection at TREC-7. Methods used for indexing text is described together with the results: this includes the official collections BASE1, plus some larger unofficial collections named BASE2 and BASE4. Search times on these collections are described and discussed with a particular emphasis on scaleup: for both weighted term search and passage retrieval. The various configurations for experiments are described
Recommended from our members
Parallel computing for passage retrieval
In this paper we examine methods for both speeding up passage processing and examining more passages using parallel computers. We vary the number of passages processed in order to examine the effect on retrieval effectiveness and efficiency. The particular algorithm we apply has previously been used to good effect in Okapi experiments at TREC. We describe this algorithm and our mechanism for applying parallel computing to speed up the processing
Recommended from our members
Query exhaustivity, relevance feedback and search success in automatic and interactive query expansion
This study explored how the expression of search facets and relevance feedback by users was related to search success in interactive and automatic query expansion in the course of the search process. Search success was measured both in the number of relevant documents retrieved and relevance scores of these items based on a four point scaling. Research design consisted of 26 users searching for four TREC topics in Okapi IR system, half using interactive and half automatic query expansion based on RF. The search logs were recorded, and the users filled in a questionnaire for each topic concerning various features of searching. The results showed that the exhaustivity of the query was the most significant predictor of search success, and that interactive expansion led to better search success than automatic one
Towards More Effective Techniques for Automatic Query Expansion
Techniques for automatic query expansion from top retrieved documents have recently shown promise for improving retrieval effectiveness on large collections but there is still a lack of systematic evaluation and comparative studies. In this paper we focus on term-scoring methods based on the differences between the distribution of terms in (pseudo-)relevant documents and the distribution of terms in all documents, seen as a complement or an alternative to more conventional techniques. We show that when such distributional methods are used to select expansion terms within Rocchio's classical reweighting scheme, the overall performance is not likely to improve. However, we also show that when the same distributional methods are used to both select and weight expansion terms the retrieval effectiveness may considerably improve. We then argue, based on their variation in performance on individual queries, that the set of ranked terms suggested by individual distributional methods can be combined to further improve mean performance, by analogy with ensembling classifiers, and present experimental evidence supporting this view. Taken together, our experiments show that with automatic query expansion it is possible to achieve performance gains as high as 21.34% over non-expanded query (for non-interpolated average precision). We also discuss the effect that the main parameters involved in automatic query expansion, such as query difficulty, number of selected documents, and number of selected terms, have on retrieval effectiveness
Recommended from our members
Parallel methods for the generation of partitioned inverted files
Purpose
– The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi‐gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index partitions are investigated: TermId and DocId.
Design/methodology/approach
– We use standard measures used in parallel computing such as speedup and efficiency to examine the computing results and also the space costs of our trial indexing experiments.
Findings
– The results from runs on both partitioning methods are compared and contrasted, concluding that DocId is the more efficient method.
Practical implications
– The practical implications are that the DocId partitioning method would in most circumstances be used for distributing inverted file data in a parallel computer, particularly if indexing speed is the primary consideration.
Originality/value
– The paper is of value to database administrators who manage large‐scale text collections, and who need to use parallel computing to implement their text retrieval services
Recommended from our members
Parallel methods for the update of partitioned inverted files
Purpose – An issue which tends to be ignored in information retrieval is the issue of updating inverted files. This is largely because inverted files were devised to provide fast query service, and much work has been done with the emphasis strongly on queries. In this paper we study the effect of using parallel methods for the update of inverted files in order to reduce costs, by looking at two types of partitioning for inverted files: document identifier and term identifier.
Design/methodology/approach – Raw update service and update with query service are studied with these partitioning schemes using an incremental update strategy. We use standard measures used in parallel computing such as speedup to examine the computing results and also the costs of reorganising indexes while servicing transactions.
Findings – Empirical results show that for both transaction processing and index reorganisation the document identifier method is superior. However, there is evidence that the term identifier partitioning method could be useful in a concurrent transaction processing context.
Practical implications – There is an increasing need to service updates which is now becoming a requirement of inverted files (for dynamic collections such as the Web), demonstrating that a shift in requirements of inverted file maintenance is needed from the past.
Originality/value – The paper is of value to database administrators who manage large-scale and dynamic text collections, and who need to use parallel computing to implement their text retrieval services
A parallel architecture for query processing over a terabyte of text
The Parallel Document Retrieval Engine (PADRE) has previously demonstrated that full text scanning methods supported by parallel hardware permit powerful query constructors and rapid response to changing document collections. Extensions to PADRE have been designed and implemented which make use of parallel secondary storage to allow each processing node to handle data up to 32 times the size of its primary memory. Using the largest purchasable machine on which PADRE currently runs, these increase the maximum possible collection size to one terabyte. This paper addresses the practicality of achieving this limit and the extent to which the performance, responsiveness, functionality and scalability of the full text scanning PADRE are preserved in the extended version
Noise-tolerance feasibility for restricted-domain Information Retrieval systems
Information Retrieval systems normally have to work with rather heterogeneous sources, such as Web sites or documents from Optical Character Recognition tools. The correct conversion of these sources into flat text files is not a trivial task since noise may easily be introduced as a result of spelling or typeset errors. Interestingly, this is not a great drawback when the size of the corpus is sufficiently large, since redundancy helps to overcome noise problems. However, noise becomes a serious problem in restricted-domain Information Retrieval specially when the corpus is small and has little or no redundancy. This paper devises an approach which adds noise-tolerance to Information Retrieval systems. A set of experiments carried out in the agricultural domain proves the effectiveness of the approach presented
Inter-relaão das técnicas Term Extration e Query Expansion aplicadas na recuperação de documentos textuais
Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-graduação em Engenharia e Gestão do ConhecimentoConforme Sighal (2006) as pessoas reconhecem a importância do armazenamento e busca da informação e, com o advento dos computadores, tornou-se possível o armazenamento de grandes quantidades dela em bases de dados. Em conseqüência, catalogar a informação destas bases tornou-se imprescindível. Nesse contexto, o campo da Recuperação da Informação, surgiu na década de 50, com a finalidade de promover a construção de ferramentas computacionais que permitissem aos usuários utilizar de maneira mais eficiente essas bases de dados. O principal objetivo da presente pesquisa é desenvolver um Modelo Computacional que possibilite a recuperação de documentos textuais ordenados pela similaridade semântica, baseado na intersecção das técnicas de Term Extration e Query Expansion