28 research outputs found
Predictive caching and prefetching of query results in search engines
We study the caching of query result pages in Web search engines. Popular search engines receive millions of queries per day, and ecient policies for caching query results may enable them to lower their response time and reduce their hardware requirements. We present PDC (probability driven cache), a novel scheme tailored for caching search results, that is based on a probabilistic model of search engine users. We then use a trace of over seven million queries submitted to the search engine AltaVista to evaluate PDC, as well as traditional LRU and SLRU based caching schemes. The trace driven simulations show that PDC outperforms the other policies. We also examine the prefetching of search results, and demonstrate that prefetching can increase cache hit ratios by 50% for large caches, and can double the hit ratios of small caches. When integrating prefetching into PDC, we attain hit ratios of over 0:53.
Procesamiento de consultas en motores de búsqueda: diseño y evaluación en términos de consumo de energía
Actualmente los centros de datos accedidos por los buscadores web junto con las computadoras personales consumen el 10% de la energía mundial, y de ese porcentaje aproximadamente el 2% es consumido sólo por los buscadores y sus centros de datos. Sin embargo, es de esperar que en los próximos años estos porcentajes se incrementen en un 30% o 40% debido a que el tamaño de la Web tiende a duplicarse cada ocho meses, la cantidad de usuarios que se conectan a ésta sigue creciendo y los buscadores satisfacen la creciente demanda incrementando el hardware utilizado.
En este trabajo se presentan los objetivos y los desafíos de una línea de investigación que abarca los problemas de consumo de energía que deben solucionar actualmente los grandes centros de cómputos y de datos, en particular los buscadores Web.Eje: Procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI
Procesamiento de consultas en motores de búsqueda: diseño y evaluación en términos de consumo de energía
Actualmente los centros de datos accedidos por los buscadores web junto con las computadoras personales consumen el 10% de la energía mundial, y de ese porcentaje aproximadamente el 2% es consumido sólo por los buscadores y sus centros de datos. Sin embargo, es de esperar que en los próximos años estos porcentajes se incrementen en un 30% o 40% debido a que el tamaño de la Web tiende a duplicarse cada ocho meses, la cantidad de usuarios que se conectan a ésta sigue creciendo y los buscadores satisfacen la creciente demanda incrementando el hardware utilizado.
En este trabajo se presentan los objetivos y los desafíos de una línea de investigación que abarca los problemas de consumo de energía que deben solucionar actualmente los grandes centros de cómputos y de datos, en particular los buscadores Web.Eje: Procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI
The egalitarian effect of search engines
Search engines have become key media for our scientific, economic, and social
activities by enabling people to access information on the Web in spite of its
size and complexity. On the down side, search engines bias the traffic of users
according to their page-ranking strategies, and some have argued that they
create a vicious cycle that amplifies the dominance of established and already
popular sites. We show that, contrary to these prior claims and our own
intuition, the use of search engines actually has an egalitarian effect. We
reconcile theoretical arguments with empirical evidence showing that the
combination of retrieval by search engines and search behavior by users
mitigates the attraction of popular pages, directing more traffic toward less
popular sites, even in comparison to what would be expected from users randomly
surfing the Web.Comment: 9 pages, 8 figures, 2 appendices. The final version of this e-print
has been published on the Proc. Natl. Acad. Sci. USA 103(34), 12684-12689
(2006), http://www.pnas.org/cgi/content/abstract/103/34/1268
Distribution and Use of Knowledge under the “Laws of the Web”
Empirical evidence shows that the perception of information is strongly concentrated in those environments in which a mass of producers and users of knowledge interact through a distribution medium. This paper considers the consequences of this fact for economic equilibrium analysis. In particular, it examines how the ranking schemes applied by the distribution technology affect the use of knowledge, and it then describes the characteristics of an optimal ranking scheme. The analysis is carried out using a model in which agents’ productivity is based on the stock of knowledge used. The value of a piece of information is assessed in terms of its contribution to productivity.global rankings, information and internet services, limited attention, diversity, knowledge society
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results
In-degree, PageRank, number of visits and other measures of Web page
popularity significantly influence the ranking of search results by modern
search engines. The assumption is that popularity is closely correlated with
quality, a more elusive concept that is difficult to measure directly.
Unfortunately, the correlation between popularity and quality is very weak for
newly-created pages that have yet to receive many visits and/or in-links.
Worse, since discovery of new content is largely done by querying search
engines, and because users usually focus their attention on the top few
results, newly-created but high-quality pages are effectively ``shut out,'' and
it can take a very long time before they become popular.
We propose a simple and elegant solution to this problem: the introduction of
a controlled amount of randomness into search result ranking methods. Doing so
offers new pages a chance to prove their worth, although clearly using too much
randomness will degrade result quality and annul any benefits achieved. Hence
there is a tradeoff between exploration to estimate the quality of new pages
and exploitation of pages already known to be of high quality. We study this
tradeoff both analytically and via simulation, in the context of an economic
objective function based on aggregate result quality amortized over time. We
show that a modest amount of randomness leads to improved search results
Query-driven document partitioning and collection selection
Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I
Diversity of Online Community Activities
Web sites where users create and rate content as well as form networks with
other users display long-tailed distributions in many aspects of behavior.
Using behavior on one such community site, Essembly, we propose and evaluate
plausible mechanisms to explain these behaviors. Unlike purely descriptive
models, these mechanisms rely on user behaviors based on information available
locally to each user. For Essembly, we find the long-tails arise from large
differences among user activity rates and qualities of the rated content, as
well as the extensive variability in the time users devote to the site. We show
that the models not only explain overall behavior but also allow estimating the
quality of content from their early behaviors.Comment: 14 page