92 research outputs found

    Performance of query processing implementations in ranking-based text retrieval systems using inverted indices

    Get PDF
    Cataloged from PDF version of article.Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented. (C) 2005 Elsevier Ltd. All rights reserved

    Document replication strategies for geographically distributed web search engines

    Get PDF
    Cataloged from PDF version of article.Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. (C) 2012 Elsevier Ltd. All rights reserved

    Improving the performance of independent task assignment heuristics minmin, maxmin and mufferage

    Get PDF
    Cataloged from PDF version of article.MinMin, MaxMin, and Sufferage are constructive heuristics that are widely and successfully used in assigning independent tasks to processors in heterogeneous computing systems. All three heuristics are known to run in O(KN2) time in assigning N tasks to K processors. In this paper, we propose an algorithmic improvement that asymptotically decreases the running time complexity of MinMin to O(KN log N) without affecting its solution quality. Furthermore, we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage, obtaining two hybrid algorithms. The motivation behind the former hybrid algorithm is to address the drawback of MaxMin in solving problem instances with highly skewed cost distributions while also improving the running time performance of MaxMin. The latter hybrid algorithm improves the running time performance of Sufferage without degrading its solution quality. The proposed algorithms are easy to implement and we illustrate them through detailed pseudocodes. The experimental results over a large number of real-life data sets show that the proposed fast MinMin algorithm and the proposed hybrid algorithms perform significantly better than their traditional counterparts as well as more recent state-of-the-art assignment heuristics. For the large data sets used in the experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art heuristics, require days, weeks, or even months to produce a solution, whereas all of the proposed algorithms produce solutions within only two or three minutes

    Site-Based Partitioning and Repartitioning Techniques for Parallel PageRank Computation

    Get PDF
    Cataloged from PDF version of article.The PageRank algorithm is an important component in effective web search. At the core of this algorithm are repeated sparse matrix-vector multiplications where the involved web matrices grow in parallel with the growth of the web and are stored in a distributed manner due to space limitations. Hence, the PageRank computation, which is frequently repeated, must be performed in parallel with high-efficiency and low-preprocessing overhead while considering the initial distributed nature of the web matrices. Our contributions in this work are twofold. We first investigate the application of state-of-the-art sparse matrix partitioning models in order to attain high efficiency in parallel PageRank computations with a particular focus on reducing the preprocessing overhead they introduce. For this purpose, we evaluate two different compression schemes on the web matrix using the site information inherently available in links. Second, we consider the more realistic scenario of starting with an initially distributed data and extend our algorithms to cover the repartitioning of such data for efficient PageRank computation. We report performance results using our parallelization of a state-of-the-art PageRank algorithm on two different PC clusters with 40 and 64 processors. Experiments show that the proposed techniques achieve considerably high speedups while incurring a preprocessing overhead of several iterations (for some instances even less than a single iteration) of the underlying sequential PageRank algorithm. © 2011 IEEE

    Second chance: A hybrid approach for dynamic result caching and prefetching in search engines

    Get PDF
    Cataloged from PDF version of article.Web search engines are known to cache the results of previously issued queries. The stored results typically contain the document summaries and some data that is used to construct the final search result page returned to the user. An alternative strategy is to store in the cache only the result document IDs, which take much less space, allowing results of more queries to be cached. These two strategies lead to an interesting trade-off between the hit rate and the average query response latency. In this work, in order to exploit this trade-off, we propose a hybrid result caching strategy where a dynamic result cache is split into two sections: an HTML cache and a docID cache. Moreover, using a realistic cost model, we evaluate the performance of different result prefetching strategies for the proposed hybrid cache and the baseline HTML-only cache. Finally, we propose a machine learning approach to predict singleton queries, which occur only once in the query stream. We show that when the proposed hybrid result caching strategy is coupled with the singleton query predictor, the hit rate is further improved. © 2013 ACM

    A link-based storage scheme for efficient aggregate query processing on clustered road networks

    Get PDF
    Cataloged from PDF version of article.The need to have efficient storage schemes for spatial networks is apparent when the volume of query processing in some road networks (e.g., the navigation systems) is considered. Specifically, under the assumption that the road network is stored in a central server, the adjacent data elements in the network must be clustered on the disk in such a way that the number of disk page accesses is kept minimal during the processing of network queries. In this work, we introduce the link-based storage scheme for clustered road networks and compare it with the previously proposed junction-based storage scheme. in order to investigate the performance of aggregate network queries in clustered road networks, we extend our recently proposed clustering hypergraph model from junction-based storage to link-based storage. We propose techniques for additional storage savings in bidirectional networks that make the link-based storage scheme even more preferable in terms of the storage efficiency. We evaluate the performance of our link-based storage scheme against the junction-based storage scheme both theoretically and empirically. The results of the experiments conducted on a wide range of road network datasets show that the link-based storage scheme is preferable in terms of both storage and query processing efficiency. (C) 2009 Elsevier B.V. All rights reserved

    Cache-based query processing for search engines

    Get PDF
    Cataloged from PDF version of article.In practice, a search engine may fail to serve a query due to various reasons such as hardware/network failures, excessive query load, lack of matching documents, or service contract limitations (e.g., the query rate limits for third-party users of a search service). In this kind of scenarios, where the backend search system is unable to generate answers to queries, approximate answers can be generated by exploiting the previously computed query results available in the result cache of the search engine.In this work, we propose two alternative strategies to implement this cache-based query processing idea. The first strategy aggregates the results of similar queries that are previously cached in order to create synthetic results for new queries. The second strategy forms an inverted index over the textual information (i.e., query terms and result snippets) present in the result cache and uses this index to answer new queries. Both approaches achieve reasonable result qualities compared to processing queries with an inverted index built on the collection. © 2012 ACM

    A Model for Task Repartioning under Data Replication

    Get PDF
    We propose a two-phase model for solving the problem of task repartitioning under data replication with memory constraints. The hypergraph-partitioning-based model proposed for the first phase aims to minimize the total message volume that will be incurred due to the replication/migration of input data while maintaining balance on computational and receive-volume loads of processors. The network-flow-based model proposed for the second phase aims to minimize the maximum message volume handled by processors via utilizing the flexibility in assigning send-communication tasks to processors, which is introduced by data replication. The validity of our proposed model is verified on parallelization of a direct volume rendering algorithm

    Anti-feminist Misogynist Shitposting: The Challenges of Feminist Academics Navigating Toxic Twitter

    Get PDF
    What does it mean to be pedagogical in a post-truth landscape? How might feminist thought and action work to intervene in this environment? Gender in an Era of Post-truth Populism draws together leading feminist scholars of gender and education to explore the current significance of the rise of populist policies and discourses and the challenges it poses to the hard-won battles regarding the rights of women, immigrants, and minorities. Offering the first detailed feminist intervention in this space, the collection explores the significance of populism for feminist pedagogies and practices in relation to gender and education. This exploration has significance for broader and urgent questions of our times regarding knowledge, authority, truth, power and harm and considers the potential for feminist interventions in relation to pedagogies and activisms to speak back and disrupt populist agendas

    Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices

    Get PDF
    K-way hypergraph partitioning has an ever-growing use in parallelization of scientific computing applications. We claim that hypergraph partitioning with multiple constraints and fixed vertices should be implemented using direct K-way refinement, instead of the widely adopted recursive bisection paradigm. Our arguments are based on the fact that recursive-bisection-based partitioning algorithms perform considerably worse when used in the multiple constraint and fixed vertex formulations. We discuss possible reasons for this performance degradation. We describe a careful implementation of a multi-level direct K-way hypergraph partitioning algorithm, which performs better than a well-known recursive-bisection-based partitioning algorithm in hypergraph partitioning with multiple constraints and fixed vertices. We also experimentally show that the proposed algorithm is effective in standard hypergraph partitioning. © 2007 Elsevier Inc. All rights reserved
    corecore