1,425 research outputs found
Comprehensive characterization of an open source document search engine
This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1
Recommended from our members
Parallel methods for the generation of partitioned inverted files
Purpose
– The generation of inverted indexes is one of the most computationally intensive activities for information retrieval systems: indexing large multi‐gigabyte text databases can take many hours or even days to complete. We examine the generation of partitioned inverted files in order to speed up the process of indexing. Two types of index partitions are investigated: TermId and DocId.
Design/methodology/approach
– We use standard measures used in parallel computing such as speedup and efficiency to examine the computing results and also the space costs of our trial indexing experiments.
Findings
– The results from runs on both partitioning methods are compared and contrasted, concluding that DocId is the more efficient method.
Practical implications
– The practical implications are that the DocId partitioning method would in most circumstances be used for distributing inverted file data in a parallel computer, particularly if indexing speed is the primary consideration.
Originality/value
– The paper is of value to database administrators who manage large‐scale text collections, and who need to use parallel computing to implement their text retrieval services
Recommended from our members
Parallel computing in information retrieval - An updated review
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for Text Retrieval. We analyse parallel IR systems using a classification due to Rasmussen [1] and describe some parallel IR systems. We give a description of the retrieval models used in parallel Information Processing.. We describe areas of research which we believe are needed
Recommended from our members
PLIERS: A Parallel Information Retrieval System Using MPI
The use of MPI in implementing algorithms for Parallel Information Retrieval Systems is outlined. We include descriptions on methods for Indexing, Search and Update of Inverted Indexes as well as a method for Information Filtering. In Indexing we describe both local build and distributed build methods. Our description of Document Search includes that for Term Weighting, Boolean, Proximity and Passage Retrieval Operations. Document Update issues are centred on how partitioning methods are supported. We describe the implementation of term selection algorithms for Information Filtering and finally work in progress is outlined
Recommended from our members
PLIERS at VLC2
This paper describes experiments done on the VLC2 collection at TREC-7. Methods used for indexing text is described together with the results: this includes the official collections BASE1, plus some larger unofficial collections named BASE2 and BASE4. Search times on these collections are described and discussed with a particular emphasis on scaleup: for both weighted term search and passage retrieval. The various configurations for experiments are described
Vertical framing of superimposed signature files using partial evaluation of queries
Cataloged from PDF version of article.A new signature file method, Multi-Frame Signature File (MFSF), is introduced by extending the bit-sliced signature file method. In MFSF a signature file is divided into variable sized vertical frames with different on-bit densities to optimize the response time using a partial query evaluation methodology. In query evaluation the on-bits of the lower on-bit density frames are used first. As the number of query terms increases, the number of query signature on-bits in the lower on-bit density frames increases and the query stopping condition is reached in fewer evaluation steps. Therefore, in MFSF, the query evaluation time decreases for increasing numbers of query terms. Under the sequentiality assumption of disk blocks, in a PC environment with 30 ms average disk seek time, MFSF provides a projected worst-case response time of 3.54 seconds for a database size of one million records in a uniform distribution multi-term query environment with 1-5 terms per query. Due to partial evaluation, this desired response time is guaranteed for queries with several terms. The comparison of MFSF with the inverted file approach shows that MFSF provides promising research opportunities. (C) 1997 Elsevier Science Ltd
Peer to Peer Information Retrieval: An Overview
Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real- world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom
- …