4,373 research outputs found

    Comprehensive characterization of an open source document search engine

    Get PDF
    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1

    Cache-based query processing for search engines

    Get PDF
    Cataloged from PDF version of article.In practice, a search engine may fail to serve a query due to various reasons such as hardware/network failures, excessive query load, lack of matching documents, or service contract limitations (e.g., the query rate limits for third-party users of a search service). In this kind of scenarios, where the backend search system is unable to generate answers to queries, approximate answers can be generated by exploiting the previously computed query results available in the result cache of the search engine.In this work, we propose two alternative strategies to implement this cache-based query processing idea. The first strategy aggregates the results of similar queries that are previously cached in order to create synthetic results for new queries. The second strategy forms an inverted index over the textual information (i.e., query terms and result snippets) present in the result cache and uses this index to answer new queries. Both approaches achieve reasonable result qualities compared to processing queries with an inverted index built on the collection. © 2012 ACM

    Document replication strategies for geographically distributed web search engines

    Get PDF
    Cataloged from PDF version of article.Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. (C) 2012 Elsevier Ltd. All rights reserved

    Distributed search based on self-indexed compressed text

    Get PDF
    Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Fil: Arroyuelo, Diego. No especifíca;Fil: Gil Costa, Graciela Verónica. Universidad Nacional de San Luis; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: González, Senén. No especifíca;Fil: Marin, Mauricio. Universidad de Santiago de Chile; ChileFil: Oyarzún, Mauricio. Universidad de Santiago de Chile; Chil

    Is Google the next Microsoft? Competition, Welfare and Regulation in Internet Search

    Get PDF
    Internet search (or perhaps more accurately `web-search') has grown exponentially over the last decade at an even more rapid rate than the Internet itself. Starting from nothing in the 1990s, today search is a multi-billion dollar business. Search engine providers such as Google and Yahoo! have become household names, and the use of a search engine, like use of the Web, is now a part of everyday life. The rapid growth of online search and its growing centrality to the ecology of the Internet raise a variety of questions for economists to answer. Why is the search engine market so concentrated and will it evolve towards monopoly? What are the implications of this concentration for different `participants' (consumers, search engines, advertisers)? Does the fact that search engines act as `information gatekeepers', determining, in effect, what can be found on the web, mean that search deserves particularly close attention from policy-makers? This paper supplies empirical and theoretical material with which to examine many of these questions. In particular, we (a) show that the already large levels of concentration are likely to continue (b) identify the consequences, negative and positive, of this outcome (c) discuss the possible regulatory interventions that policy-makers could utilize to address these

    Algorithms and Speech

    Get PDF
    One of the central questions in free speech jurisprudence is what activities the First Amendment encompasses. This Article considers that question in the context of an area of increasing importance – algorithm-based decisions. I begin by looking to broadly accepted legal sources, which for the First Amendment means primarily Supreme Court jurisprudence. That jurisprudence provides for very broad First Amendment coverage, and the Court has reinforced that breadth in recent cases. Under the Court’s jurisprudence the First Amendment (and the heightened scrutiny it entails) would apply to many algorithm-based decisions, specifically those entailing substantive communications. We could of course adopt a limiting conception of the First Amendment, but any nonarbitrary exclusion of algorithm-based decisions would require major changes in the Court’s jurisprudence. I believe that First Amendment coverage of algorithm-based decisions is too small a step to justify such changes. But insofar as we are concerned about the expansiveness of First Amendment coverage, we may want to limit it in two areas of genuine uncertainty: editorial decisions that are neither obvious nor communicated to the reader, and laws that single out speakers but do not regulate their speech. Even with those limitations, however, an enormous and growing amount of activity will be subject to heightened scrutiny absent a fundamental reorientation of First Amendment jurisprudence

    Algorithms and Speech

    Get PDF
    One of the central questions in free speech jurisprudence is what activities the First Amendment encompasses. This Article considers that question in the context of an area of increasing importance – algorithm-based decisions. I begin by looking to broadly accepted legal sources, which for the First Amendment means primarily Supreme Court jurisprudence. That jurisprudence provides for very broad First Amendment coverage, and the Court has reinforced that breadth in recent cases. Under the Court’s jurisprudence the First Amendment (and the heightened scrutiny it entails) would apply to many algorithm-based decisions, specifically those entailing substantive communications. We could of course adopt a limiting conception of the First Amendment, but any nonarbitrary exclusion of algorithm-based decisions would require major changes in the Court’s jurisprudence. I believe that First Amendment coverage of algorithm-based decisions is too small a step to justify such changes. But insofar as we are concerned about the expansiveness of First Amendment coverage, we may want to limit it in two areas of genuine uncertainty: editorial decisions that are neither obvious nor communicated to the reader, and laws that single out speakers but do not regulate their speech. Even with those limitations, however, an enormous and growing amount of activity will be subject to heightened scrutiny absent a fundamental reorientation of First Amendment jurisprudence
    • …
    corecore