861 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Matrix Factorization at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

    Full text link
    We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance

    Teadusarvutuse algoritmide taandamine hajusarvutuse raamistikele

    Get PDF
    Teadusarvutuses kasutatakse arvuteid ja algoritme selleks, et lahendada probleeme erinevates reaalteadustes nagu geneetika, bioloogia ja keemia. Tihti on eesmärgiks selliste loodusnähtuste modelleerimine ja simuleerimine, mida päris keskkonnas oleks väga raske uurida. Näiteks on võimalik luua päikesetormi või meteoriiditabamuse mudel ning arvutisimulatsioonide abil hinnata katastroofi mõju keskkonnale. Mida keerulisemad ja täpsemad on sellised simulatsioonid, seda rohkem arvutusvõimsust on vaja. Tihti kasutatakse selleks suurt hulka arvuteid, mis kõik samaaegselt töötavad ühe probleemi kallal. Selliseid arvutusi nimetatakse paralleel- või hajusarvutusteks. Hajusarvutuse programmide loomine on aga keeruline ning nõuab palju rohkem aega ja ressursse, kuna vaja on sünkroniseerida erinevates arvutites samaaegselt tehtavat tööd. On loodud mitmeid tarkvararaamistikke, mis lihtsustavad seda tööd automatiseerides osa hajusprogrammeerimisest. Selle teadustöö eesmärk oli uurida selliste hajusarvutusraamistike sobivust keerulisemate teadusarvutuse algoritmide jaoks. Tulemused näitasid, et olemasolevad raamistikud on üksteisest väga erinevad ning neist ükski ei ole sobiv kõigi erinevat tüüpi algoritmide jaoks. Mõni raamistik on sobiv ainult lihtsamate algoritmide jaoks; mõni ei sobi olukorras, kus andmed ei mahu arvutite mällu. Algoritmi jaoks kõige sobivama hajusarvutisraamistiku valimine võib olla väga keeruline ülesanne, kuna see nõuab olemasolevate raamistike uurimist ja rakendamist. Sellele probleemile lahendust otsides otsustati luua dünaamiline algoritmide modelleerimise rakendus (DAMR), mis oskab simuleerida algoritmi implementatsioone erinevates hajusarvutusraamistikes. DAMR aitab hinnata milline hajusraamistik on kõige sobivam ette antud algoritmi jaoks, ilma algoritmi reaalselt ühegi hajusraamistiku peale implementeerimata. Selle uurimustöö peamine panus on hajusarvutusraamistike kasutuselevõtu lihtsamaks tegemine teadlastele, kes ei ole varem nende kasutamisega kokku puutunud. See peaks märkimisväärselt aega ja ressursse kokku hoidma, kuna ei pea ükshaaval kõiki olemasolevaid hajusraamistikke tundma õppima ja rakendama.Scientific computing uses computers and algorithms to solve problems in various sciences such as genetics, biology and chemistry. Often the goal is to model and simulate different natural phenomena which would otherwise be very difficult to study in real environments. For example, it is possible to create a model of a solar storm or a meteor hit and run computer simulations to assess the impact of the disaster on the environment. The more sophisticated and accurate the simulations are the more computing power is required. It is often necessary to use a large number of computers, all working simultaneously on a single problem. These kind of computations are called parallel or distributed computing. However, creating distributed computing programs is complicated and requires a lot more time and resources, because it is necessary to synchronize different computers working at the same time. A number of software frameworks have been created to simplify this process by automating part of a distributed programming. The goal of this research was to assess the suitability of such distributed computing frameworks for complex scientific computing algorithms. The results showed that existing frameworks are very different from each other and none of them are suitable for all different types of algorithms. Some frameworks are only suitable for simple algorithms; others are not suitable when data does not fit into the computer memory. Choosing the most appropriate distributed computing framework for an algorithm can be a very complex task, because it requires studying and applying the existing frameworks. While searching for a solution to this problem, it was decided to create a Dynamic Algorithms Modelling Application (DAMA), which is able to simulate the implementation of the algorithm in different distributed computing frameworks. DAMA helps to estimate which distributed framework is the most appropriate for a given algorithm, without actually implementing it in any of the available frameworks. This main contribution of this study is simplifying the adoption of distributed computing frameworks for researchers who are not yet familiar with using them. It should save significant time and resources as it is not necessary to study each of the available distributed computing frameworks in detail

    Implementing Parallel Differential Evolution on Spark

    Get PDF
    [Abstract] Metaheuristics are gaining increased attention as an efficient way of solving hard global optimization problems. Differential Evolution (DE) is one of the most popular algorithms in that class. However, its application to realistic problems results in excessive computation times. Therefore, several parallel DE schemes have been proposed, most of them focused on traditional parallel programming interfaces and infrastruc- tures. However, with the emergence of Cloud Computing, new program- ming models, like Spark, have appeared to suit with large-scale data processing on clouds. In this paper we investigate the applicability of Spark to develop parallel DE schemes to be executed in a distributed environment. Both the master-slave and the island-based DE schemes usually found in the literature have been implemented using Spark. The speedup and efficiency of all the implementations were evaluated on the Amazon Web Services (AWS) public cloud, concluding that the island- based solution is the best suited to the distributed nature of Spark. It achieves a good speedup versus the serial implementation, and shows a decent scalability when the number of nodes grows.[Resumen] Las metaheurísticas están recibiendo una atención creciente como técnica eficiente en la resolución de problemas difíciles de optimización global. Differential Evolution (DE) es una de las metaheurísticas más populares, sin embargo su aplicación en problemas reales deriva en tiempos de cómputo excesivos. Por ello se han realizado diferentes propuestas para la paralelización del DE, en su mayoría utilizando infraestructuras e interfaces de programación paralela tradicionales. Con la aparición de la computación en la nube también se han propuesto nuevos modelos de programación, como Spark, que permiten manejar el procesamiento de datos a gran escala en la nube. En este artículo investigamos la aplicabilidad de Spark en el desarrollo de implementaciones paralelas del DE para su ejecución en entornos distribuidos. Se han implementado tanto la aproximación master-slave como la basada en islas, que son las más comunes. También se han evaluado la aceleración y la eficiencia de todas las implementaciones usando el cloud público de Amazon (AWS, Amazon Web Services), concluyéndose que la implementación basada en islas es la más adecuada para el esquema de distribución usado por Spark. Esta implementación obtiene una buena aceleración en relación a la implementación serie y muestra una escalabilidad bastante buena cuando el número de nodos aumenta.[Resume] As metaheurísticas están recibindo unha atención a cada vez maior como técnica eficiente na resolución de problemas difíciles de optimización global. Differential Evolution (DE) é unha das metaheurísticas mais populares, ainda que a sua aplicación a problemas reais deriva en tempos de cómputo excesivos. É por iso que se propuxeron diferentes esquemas para a paralelización do DE, na sua maioría utilizando infraestruturas e interfaces de programación paralela tradicionais. Coa aparición da computación na nube tamén se propuxeron novos modelos de programación, como Spark, que permiten manexar o procesamento de datos a grande escala na nube. Neste artigo investigamos a aplicabilidade de Spark no desenvolvimento de implementacións paralelas do DE para a sua execución en contornas distribuidas. Implementáronse tanto a aproximación master-slave como a baseada en illas, que son as mais comúns. Tamén se avaliaron a aceleración e a eficiencia de todas as implementacións usando o cloud público de Amazon (AWS, Amazon Web Services), tirando como conclusión que a implementación baseada en illas é a mais acaida para o esquema de distribución usado por Spark. Esta implementación obtén unha boa aceleración en relación á implementación serie e amosa unha escalabilidade bastante boa cando o número de nos aumenta.Ministerio de Economía y Competitividad; DPI2014-55276-C5-2-RXunta de Galicia; GRC2013/055Xunta de Galicia; R2014/04

    Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments

    Get PDF
    This chapter presents software architectures of the big data processing platforms. It will provide an in-depth knowledge on resource management techniques involved while deploying big data processing systems on cloud environment. It starts from the very basics and gradually introduce the core components of resource management which we have divided in multiple layers. It covers the state-of-art practices and researches done in SLA-based resource management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
    corecore