28 research outputs found

    Predictive caching and prefetching of query results in search engines

    Get PDF
    We study the caching of query result pages in Web search engines. Popular search engines receive millions of queries per day, and ecient policies for caching query results may enable them to lower their response time and reduce their hardware requirements. We present PDC (probability driven cache), a novel scheme tailored for caching search results, that is based on a probabilistic model of search engine users. We then use a trace of over seven million queries submitted to the search engine AltaVista to evaluate PDC, as well as traditional LRU and SLRU based caching schemes. The trace driven simulations show that PDC outperforms the other policies. We also examine the prefetching of search results, and demonstrate that prefetching can increase cache hit ratios by 50% for large caches, and can double the hit ratios of small caches. When integrating prefetching into PDC, we attain hit ratios of over 0:53.

    Predictive caching and prefetching of query results in search engines

    Full text link

    Procesamiento de consultas en motores de búsqueda: diseño y evaluación en términos de consumo de energía

    Get PDF
    Actualmente los centros de datos accedidos por los buscadores web junto con las computadoras personales consumen el 10% de la energía mundial, y de ese porcentaje aproximadamente el 2% es consumido sólo por los buscadores y sus centros de datos. Sin embargo, es de esperar que en los próximos años estos porcentajes se incrementen en un 30% o 40% debido a que el tamaño de la Web tiende a duplicarse cada ocho meses, la cantidad de usuarios que se conectan a ésta sigue creciendo y los buscadores satisfacen la creciente demanda incrementando el hardware utilizado. En este trabajo se presentan los objetivos y los desafíos de una línea de investigación que abarca los problemas de consumo de energía que deben solucionar actualmente los grandes centros de cómputos y de datos, en particular los buscadores Web.Eje: Procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI

    Procesamiento de consultas en motores de búsqueda: diseño y evaluación en términos de consumo de energía

    Get PDF
    Actualmente los centros de datos accedidos por los buscadores web junto con las computadoras personales consumen el 10% de la energía mundial, y de ese porcentaje aproximadamente el 2% es consumido sólo por los buscadores y sus centros de datos. Sin embargo, es de esperar que en los próximos años estos porcentajes se incrementen en un 30% o 40% debido a que el tamaño de la Web tiende a duplicarse cada ocho meses, la cantidad de usuarios que se conectan a ésta sigue creciendo y los buscadores satisfacen la creciente demanda incrementando el hardware utilizado. En este trabajo se presentan los objetivos y los desafíos de una línea de investigación que abarca los problemas de consumo de energía que deben solucionar actualmente los grandes centros de cómputos y de datos, en particular los buscadores Web.Eje: Procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI

    The egalitarian effect of search engines

    Full text link
    Search engines have become key media for our scientific, economic, and social activities by enabling people to access information on the Web in spite of its size and complexity. On the down side, search engines bias the traffic of users according to their page-ranking strategies, and some have argued that they create a vicious cycle that amplifies the dominance of established and already popular sites. We show that, contrary to these prior claims and our own intuition, the use of search engines actually has an egalitarian effect. We reconcile theoretical arguments with empirical evidence showing that the combination of retrieval by search engines and search behavior by users mitigates the attraction of popular pages, directing more traffic toward less popular sites, even in comparison to what would be expected from users randomly surfing the Web.Comment: 9 pages, 8 figures, 2 appendices. The final version of this e-print has been published on the Proc. Natl. Acad. Sci. USA 103(34), 12684-12689 (2006), http://www.pnas.org/cgi/content/abstract/103/34/1268

    Distribution and Use of Knowledge under the “Laws of the Web”

    Get PDF
    Empirical evidence shows that the perception of information is strongly concentrated in those environments in which a mass of producers and users of knowledge interact through a distribution medium. This paper considers the consequences of this fact for economic equilibrium analysis. In particular, it examines how the ranking schemes applied by the distribution technology affect the use of knowledge, and it then describes the characteristics of an optimal ranking scheme. The analysis is carried out using a model in which agents’ productivity is based on the stock of knowledge used. The value of a piece of information is assessed in terms of its contribution to productivity.global rankings, information and internet services, limited attention, diversity, knowledge society

    Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results

    Get PDF
    In-degree, PageRank, number of visits and other measures of Web page popularity significantly influence the ranking of search results by modern search engines. The assumption is that popularity is closely correlated with quality, a more elusive concept that is difficult to measure directly. Unfortunately, the correlation between popularity and quality is very weak for newly-created pages that have yet to receive many visits and/or in-links. Worse, since discovery of new content is largely done by querying search engines, and because users usually focus their attention on the top few results, newly-created but high-quality pages are effectively ``shut out,'' and it can take a very long time before they become popular. We propose a simple and elegant solution to this problem: the introduction of a controlled amount of randomness into search result ranking methods. Doing so offers new pages a chance to prove their worth, although clearly using too much randomness will degrade result quality and annul any benefits achieved. Hence there is a tradeoff between exploration to estimate the quality of new pages and exploitation of pages already known to be of high quality. We study this tradeoff both analytically and via simulation, in the context of an economic objective function based on aggregate result quality amortized over time. We show that a modest amount of randomness leads to improved search results

    Query-driven document partitioning and collection selection

    Get PDF
    Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I

    Multi-Faceted Search and Navigation of Biological Databases

    Get PDF

    Diversity of Online Community Activities

    Full text link
    Web sites where users create and rate content as well as form networks with other users display long-tailed distributions in many aspects of behavior. Using behavior on one such community site, Essembly, we propose and evaluate plausible mechanisms to explain these behaviors. Unlike purely descriptive models, these mechanisms rely on user behaviors based on information available locally to each user. For Essembly, we find the long-tails arise from large differences among user activity rates and qualities of the rated content, as well as the extensive variability in the time users devote to the site. We show that the models not only explain overall behavior but also allow estimating the quality of content from their early behaviors.Comment: 14 page
    corecore