1,346 research outputs found
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems
Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation. © Springer-Verlag Berlin Heidelberg 2006
Performance comparison of clustered and replicated information retrieval systems
The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent
Recommended from our members
Parallel computing in information retrieval - An updated review
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for Text Retrieval. We analyse parallel IR systems using a classification due to Rasmussen [1] and describe some parallel IR systems. We give a description of the retrieval models used in parallel Information Processing.. We describe areas of research which we believe are needed
Recommended from our members
Models Performance Issues in Parallel Computing for Information Retrieval
A parallel framework for in-memory construction of term-partitioned inverted indexes
Cataloged from PDF version of article.With the advances in cloud computing and huge RAMs provided by 64-bit architectures, it is possible to tackle large problems using memory-based solutions. Construction of term-based, partitioned, parallel inverted indexes is a communication intensive task and suitable for memory-based modeling. In this paper, we provide an efficient parallel framework for in-memory construction of term-based partitioned, inverted indexes. We show that, by utilizing an efficient bucketing scheme, we can eliminate the need for the generation of a global vocabulary. We propose and investigate assignment schemes that can reduce the communication overheads while minimizing the storage and final query processing imbalance. We also present a study on how communication among processors should be carried out with limited communication memory in order to reduce the total inversion time. We present several different communication-memory organizations and discuss their advantages and shortcomings. The conducted experiments indicate promising results. © 2012 The Author. Published by Oxford University Press on behalf of The British Computer Society
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
SparkIR: a Scalable Distributed Information Retrieval Engine over Spark
Search engines have to deal with a huge amount of data (e.g., billions of
documents in the case of the Web) and find scalable and efficient ways to produce
effective search results. In this thesis, we propose to use Spark framework, an in
memory distributed big data processing framework, and leverage its powerful
capabilities of handling large amount of data to build an efficient and scalable
experimental search engine over textual documents. The proposed system, SparkIR,
can serve as a research framework for conducting information retrieval (IR)
experiments. SparkIR supports two indexing schemes, document-based partitioning
and term-based partitioning, to adopt document-at-a-time (DAAT) and term-at-a-time
(TAAT) query evaluation methods. Moreover, it offers static and dynamic pruning to
improve the retrieval efficiency. For static pruning, it employs champion list and
tiering, while for dynamic pruning, it uses MaxScore top k retrieval. We evaluated the
performance of SparkIR using ClueWeb12-B13 collection that contains about 50M
English Web pages. Experiments over different subsets of the collection and
compared the Elasticsearch baseline show that SparkIR exhibits reasonable efficiency
and scalability performance overall for both indexing and retrieval. Implemented as
an open-source library over Spark, users of SparkIR can also benefit from other Spark
libraries (e.g., MLlib and GraphX), which, therefore, eliminates the need of usin
Signature Files: An Integrated Access Method for Formatted and Unformatted Databases
The signature file approach is one of the most powerful information storage and retrieval techniques which is used for finding the data objects that are relevant to the user queries. The main idea of all signature based schemes is to reflect the essence of the data items into bit pattern (descriptors or signatures) and store them in a separate file which acts as a filter to eliminate the non aualifvine data items for an information reauest. It provides an integrated access method for both formattid and formatted databases. A complative
overview and discussion of the proposed signatnre generation methods and the major signature file organization schemes are presented. Applications of the signature techniques to formatted and unformatted databases, single and multiterm query cases, serial and paratlei architecture. static and dynamic environments are provided with a special emphasis on the multimedia databases where the pioneering prototype systems
using signatnres yield highly encouraging results
Recommended from our members
Parallel methods for the update of partitioned inverted files
Purpose – An issue which tends to be ignored in information retrieval is the issue of updating inverted files. This is largely because inverted files were devised to provide fast query service, and much work has been done with the emphasis strongly on queries. In this paper we study the effect of using parallel methods for the update of inverted files in order to reduce costs, by looking at two types of partitioning for inverted files: document identifier and term identifier.
Design/methodology/approach – Raw update service and update with query service are studied with these partitioning schemes using an incremental update strategy. We use standard measures used in parallel computing such as speedup to examine the computing results and also the costs of reorganising indexes while servicing transactions.
Findings – Empirical results show that for both transaction processing and index reorganisation the document identifier method is superior. However, there is evidence that the term identifier partitioning method could be useful in a concurrent transaction processing context.
Practical implications – There is an increasing need to service updates which is now becoming a requirement of inverted files (for dynamic collections such as the Web), demonstrating that a shift in requirements of inverted file maintenance is needed from the past.
Originality/value – The paper is of value to database administrators who manage large-scale and dynamic text collections, and who need to use parallel computing to implement their text retrieval services
- …