565 research outputs found

    Comprehensive characterization of an open source document search engine

    Get PDF
    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1

    Second chance: A hybrid approach for dynamic result caching and prefetching in search engines

    Get PDF
    Cataloged from PDF version of article.Web search engines are known to cache the results of previously issued queries. The stored results typically contain the document summaries and some data that is used to construct the final search result page returned to the user. An alternative strategy is to store in the cache only the result document IDs, which take much less space, allowing results of more queries to be cached. These two strategies lead to an interesting trade-off between the hit rate and the average query response latency. In this work, in order to exploit this trade-off, we propose a hybrid result caching strategy where a dynamic result cache is split into two sections: an HTML cache and a docID cache. Moreover, using a realistic cost model, we evaluate the performance of different result prefetching strategies for the proposed hybrid cache and the baseline HTML-only cache. Finally, we propose a machine learning approach to predict singleton queries, which occur only once in the query stream. We show that when the proposed hybrid result caching strategy is coupled with the singleton query predictor, the hit rate is further improved. © 2013 ACM

    Predictive caching and prefetching of query results in search engines

    Get PDF
    We study the caching of query result pages in Web search engines. Popular search engines receive millions of queries per day, and ecient policies for caching query results may enable them to lower their response time and reduce their hardware requirements. We present PDC (probability driven cache), a novel scheme tailored for caching search results, that is based on a probabilistic model of search engine users. We then use a trace of over seven million queries submitted to the search engine AltaVista to evaluate PDC, as well as traditional LRU and SLRU based caching schemes. The trace driven simulations show that PDC outperforms the other policies. We also examine the prefetching of search results, and demonstrate that prefetching can increase cache hit ratios by 50% for large caches, and can double the hit ratios of small caches. When integrating prefetching into PDC, we attain hit ratios of over 0:53.

    Index ordering by query-independent measures

    Get PDF
    Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most “important” documents within the collection, and sort documents within inverted file lists in order of this “importance”. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

    High-Performance In-Memory OLTP via Coroutine-to-Transaction

    Get PDF
    Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This thesis presents CoroBase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2Ă— better for read-intensive workloads and remain competitive for workloads that inherently do not benefit from software prefetching

    Improving the Performance of SQL Join Operation in the Distributed Enterprise Information System by Caching

    Get PDF
    The enterprise information system (EIS) contains databases and other data sources in multiple data centers. Users query the EIS via clients. The client has a working space in the cloud. Caching data in client space will reduce the total execution time of the query. However, the client space has limited resources to store data. There are two options for caching data at the client space: caching the final results of query operations, or caching the source data tables. The problem is that some query operations such as “joining multiple big tables” will simply produce a result too big to store in cache in some cases. By contrast, caching source data tables may be a better choice in those situations. This paper presents an algorithm that combines active caching and passive caching to improve the cache hit, thus improving performance of the SQL join query in the cloud computing environment

    SECF: Improving SPARQL Querying Performance with Proactive Fetching and Caching

    Get PDF
    Querying on SPARQL endpoints may be unsatisfactory due to high latency of connections to the endpoints. Caching is an important way to accelerate the query response speed. In this paper, we propose SPARQL Endpoint Caching Framework (SECF), a client-side caching framework for this purpose. In particular, we prefetch and cache the results of similar queries to recently cached query aiming to improve the overall querying performance. The similarity between queries are calculated via an improved Graph Edit Distance (GED) function. We also adapt a smoothing method to implement the cache replacement. The empirical evaluations on real world queries show that our approach has great potential to enhance the cache hit rate and accelerate the querying speed on SPARQL endpoints
    • …
    corecore