3,398 research outputs found

    A self-adapting latency/power tradeoff model for replicated search engines

    Get PDF
    For many search settings, distributed/replicated search engines deploy a large number of machines to ensure efficient retrieval. This paper investigates how the power consumption of a replicated search engine can be automatically reduced when the system has low contention, without compromising its efficiency. We propose a novel self-adapting model to analyse the trade-off between latency and power consumption for distributed search engines. When query volumes are high and there is contention for the resources, the model automatically increases the necessary number of active machines in the system to maintain acceptable query response times. On the other hand, when the load of the system is low and the queries can be served easily, the model is able to reduce the number of active machines, leading to power savings. The model bases its decisions on examining the current and historical query loads of the search engine. Our proposal is formulated as a general dynamic decision problem, which can be quickly solved by dynamic programming in response to changing query loads. Thorough experiments are conducted to validate the usefulness of the proposed adaptive model using historical Web search traffic submitted to a commercial search engine. Our results show that our proposed self-adapting model can achieve an energy saving of 33% while only degrading mean query completion time by 10 ms compared to a baseline that provisions replicas based on a previous day's traffic

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Shai: Enforcing Data-Specific Policies with Near-Zero Runtime Overhead

    Full text link
    Data retrieval systems such as online search engines and online social networks must comply with the privacy policies of personal and selectively shared data items, regulatory policies regarding data retention and censorship, and the provider's own policies regarding data use. Enforcing these policies is difficult and error-prone. Systematic techniques to enforce policies are either limited to type-based policies that apply uniformly to all data of the same type, or incur significant runtime overhead. This paper presents Shai, the first system that systematically enforces data-specific policies with near-zero overhead in the common case. Shai's key idea is to push as many policy checks as possible to an offline, ahead-of-time analysis phase, often relying on predicted values of runtime parameters such as the state of access control lists or connected users' attributes. Runtime interception is used sparingly, only to verify these predictions and to make any remaining policy checks. Our prototype implementation relies on efficient, modern OS primitives for sandboxing and isolation. We present the design of Shai and quantify its overheads on an experimental data indexing and search pipeline based on the popular search engine Apache Lucene

    Query scheduling techniques and power/latency trade-off model for large-scale search engines

    Get PDF
    [Resumen] Los motores de búsqueda actuales deben enfrentarse a un veloz incremento de información y a un enorme tráfico de consultas. Las grandes compa˜nías se han visto obligadas a construir centros de datos geográficamente distribuidos y compuestos por miles de servidores. El suministro eléctrico supone un enorme gasto energético, por lo que una peque˜na mejora a nivel de eficiencia puede suponer grandes ventajas económicas. Esta tesis permitirá a grandes compa˜nías de Recuperación de Información la construcción de motores de búsqueda dotados de mayor eficiencia. Por una parte, esta tesis propone nuevas técnicas de distribución de consultas a los servidores que las procesan para disminuir su tiempo de respuesta, estimando cuál será el primer servidor disponible. Por otra parte, esta tesis define un modelo matemático que establece un balance entre el tiempo de respuesta de un motor de búsqueda y su consumo energético. Basándonos en datos históricos y actuales, el modelo estima el tráfico de consultas entrante y, de modo automático, aumenta/disminuye los servidores necesarios para procesar las consultas. Se consigue así un gran porcentaje de ahorro energético sin degradar la latencia del sistema. Nuestros experimentos atestiguan las grandes mejoras alcanzadas en cuanto a eficiencia y ahorro energético.[Resumo] Os motores de busca actuais deben enfrontarse a un grande incremento de información e a un enorme tráfico de consultas. As grandes compa˜nías víronse obrigadas a construír centros de datos xeograficamente distribuídos e compostos por milleiros de servidores. A subministración eléctrica supón un enorme gasto enerxético, polo que una pequena mellora a nivel de eficiencia pode supo˜ner grandes vantaxes económicas. Esta tese permitir´a a grandes compa˜n´ıas de Recuperaci´on de Información a construción de motores de busca dotados de maior eficiencia. Por una parte, esta tese propón novas técnicas de distribución de consultas aos servidores que as procesan para diminuír su tempo de resposta, estimando cál será o primeiro servidor dispo˜nible. Por outra parte, esta tese define un modelo matemático que establece un balance entre o tempo de resposta dun motor de busca e o seu consumo enerxético. A partir de datos históricos e actuais, o modelo estima o tráfico de consultas entrantes e automaticamente aumenta/diminúe os servidores necesarios para procesar as consultas. Conséguese así unha grande porcentaxe de aforro enerxético sen degradar a latencia do sistema. Os nosos experimentos testemu˜nan as grandes melloras alcanzadas en canto a eficiencia e aforro enerxético.[Abstract] Web search engines have to deal with a rapid increase of information, demanded by high incoming query traffic. This situation has driven companies to build geographically distributed data centres housing thousands of computers, consuming enormous amounts of electricity and requiring a huge infrastructure around. At this scale, even minor efficiency improvements result in large financial savings. This thesis represents a novel contribution to query scheduling and power consumption state-of-the-art, by assisting large-scale data centres to build more efficient search engines. On the one hand, this thesis proposes new scheduling techniques to decrease the response time of queries, by estimating the server that will be idle soonest. On the other hand, this thesis defines a simple mathematical model that establishes a threshold between the power and latency of a search engine. Using historical and current data, the model estimates the incoming query traffic and automatically increases/decreases the necessary number of active machines in the system. We achieve high energy savings during the whole day, without degrading the latency. Our experiments have attested the power of both scheduling methods and the power/latency trade-off model in improving the efficiency and achieving high energy savings

    A Taxonomy of Workflow Management Systems for Grid Computing

    Full text link
    With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure
    corecore