9,078 research outputs found

    Modeling the Performance of MapReduce Applications for the Cloud

    Get PDF
    In the last years, Cloud Computing has become a key technology that made possible to run applications without needing to deploy a physical infrastructure. The challenge with deploying distributed applications in Cloud Computing environments is that the virtual machine infrastructure should be planned in a time and cost-effective way.This work is a summary of a previous work presented by the authors as a Master’s thesis, with the goal of showing that the execution time of a distributed MapReduce application, running in a Cloud computing environment, can be predicted using a mathematical model based on theoretical specifications. This prediction is made to help the users of the Cloud Computing environment to plan their deployments, i.e., quantify the number of virtual machines and its characteristics. After measuring the application execution time and varying parameters stated in the mathematical model, and after that, using a linear regression technique, the goal is achieved finding a model of the execution time which was then applied to predict the execution time of MapReduce applications. Experiments were conducted in several configurations and showed a clear relation with the theoretical model, revealing that the model is in fact able to predict the execution time of MapReduce applications. The developed model is generic, meaning that it uses theoretical abstractions for the computing capacity of the environment and the computing cost of the MapReduce application.  En los últimos años, Cloud Computing se ha convertido en una tecnología clave que ha hecho posible ejecutar aplicaciones sin la necesidad de utilizar una infraestructura física. El desafío de implementar aplicaciones distribuidas en ambientes de Cloud Computing es que la infraestructura de máquinas virtuales debe considerar aspectos relacionados con el costo y el tiempo de utilización. Este trabajo es el resumen de uno anterior, presentado por los autores como tesis de maestría, con el objetivo de demostrar que el tiempo de ejecución de una aplicación distribuida MapReduce, ejecutándose en un ambiente de Cloud Computing, puede ser predicho utilizando un modelo matemático basado en especificaciones teóricas. Esta predicción se realiza para ayudar a los usuarios de un ambiente de Cloud Computing a planificar sus implementaciones, es decir, cuantificar el número de máquinas virtuales y sus características.Después de medir el tiempo de ejecución de las aplicaciones y variando los parámetros establecidos por el modelo matemático, y seguidamente usando una técnica de regresión lineal, el objetivo se alcanza al encontrar un modelo del tiempo de ejecución que fue posteriormente aplicado para aplicaciones MapReduce. Los experimentos fueron realizados en diferentes configuraciones y mostraron una clara relación con el modelo teórico, mostrando así que el modelo es capaz de predecir el tiempo de ejecución de aplicaciones MapReduce. El modelo desarrollado es genérico, es decir que usa abstracciones teóricas para la capacidad de cómputo del ambiente y el costo computacional de la aplicacin MapReduce

    Performance Modeling and Resource Management for Mapreduce Applications

    Get PDF
    Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implementation Hadoop as a platform choice. Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs. An increasing number of these applications have additional requirements for completion time guarantees. The advent of cloud computing brings a competitive alternative solution for data analytic problems while it also introduces new challenges in provisioning clusters that provide best cost-performance trade-offs. In this dissertation, we aim to develop a performance evaluation framework that enables automatic resource management for MapReduce applications in achieving different optimization goals. It consists of the following components: (1) a performance modeling framework that estimates the completion time of a given MapReduce application when executed on a Hadoop cluster according to its input data sets, the job settings and the amount of allocated resources for processing it; (2) a resource allocation strategy for deadline-driven MapReduce applications that automatically tailors and controls the resource allocation on a shared Hadoop cluster to different applications to achieve their (soft) deadlines; (3) a simulator-based solution to the resource provision problem in public cloud environment that guides the users to determine the types and amount of resources that should lease from the service provider for achieving different goals; (4) an optimization strategy to automatically determine the optimal job settings within a MapReduce application for efficient execution and resource usage. We validate the accuracy, efficiency, and performance benefits of the proposed framework using a set of realistic MapReduce applications on both private cluster and public cloud environment

    Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center

    Full text link
    [EN] Total profit is one of the most important factors to be considered from the perspective of resource providers. In this paper, an original MapReduce workflow scheduling with deadline and data locality is proposed to maximize total profit of resource providers. A new workflow conversion based on dynamic programming and ChainMap/ChainReduce is designed to decrease transmission times among MapReduce jobs of workflows. A new deadline division considering execution time, float time and job level is proposed to obtain better deadlines of MapReduce jobs in workflows. With the adapted replica strategy in MapReduce workflow, a new task scheduling is proposed to improve data locality which assigns tasks to servers with the earliest completion time in order to ensure resource providers obtain more profit. Experimental results show that the proposed heuristic results in larger total profit than other adopted algorithms.This work is supported by the National Key Research and Development Program of China (No. 2017YFB1400801), the National Natural Science Foundation of China (Nos. 61872077, 61832004) and Collaborative Innovation Center of Wireless Communications Technology. Rubén Ruiz is partly supported by the Spanish Ministry of Science, Innovation, and Universities, under the project ¿OPTEP-Port Terminal Operations Optimization¿ (No. RTI2018-094940-B-I00) financed with FEDER funds¿.Wang, J.; Li, X.; Ruiz García, R.; Xu, H.; Chu, D. (2020). Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center. Service Oriented Computing and Applications. 14(2):101-118. https://doi.org/10.1007/s11761-020-00290-1S101118142Zaharia M, Chowdhury M, Franklin M et al (2010) Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing, pp 1765–1773Li L, Ma Z, Liu L et al (2013) Hadoop-based ARIMA algorithm and its application in weather forecast. Int J Database Theory Appl 6(5):119–132Xun Y, Zhang J, Qin X (2017) FiDoop: parallel mining of frequent itemsets using MapReduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. IEEE Trans Cloud Comput 2(3):306–319Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–49Bu Y, Howe B, Balazinska M et al (2012) The HaLoop approach to large-scale iterative data analysis. VLDB J 21(2):169–190Gunarathne T, Zhang B, Wu T et al (2013) Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Gener Comput Syst 29(4):1035–1048Zhang Y, Gao Q, Gao L et al (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: International conference on parallel and distributed systems, pp 9–16Tang Z, Zhou J, Li K et al (2013) A MapReduce task scheduling algorithm for deadline constraints. Clust Comput 16(4):651–662Zhang W, Rajasekaran S, Wood T et al (2014) MIMP: deadline and interference aware scheduling of Hadoop virtual machines. In: International symposium on cluster, cloud and grid computing, pp 394–403Teng F, Magoulès F, Yu L et al (2014) A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. J Supercomput 69(2):739–765Palanisamy B, Singh A, Liu L (2015) Cost-effective resource provisioning for MapReduce in a cloud. IEEE Trans Parallel Distrib Syst 26(5):1265–1279Hashem I, Anuar N, Marjani M et al (2018) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):9979–9994Xu X, Tang M, Tian Y (2017) QoS-guaranteed resource provisioning for cloud-based MapReduce in dynamical environments. Future Gener Comput Syst 78(1):18–30Li H, Wei X, Fu Q et al (2014) MapReduce delay scheduling with deadline constraint. Concurr Comput Pract Exp 26(3):766–778Polo J, Becerra Y, Carrera D et al (2013) Deadline-based MapReduce workload management. IEEE Trans Netw Serv Manag 10(2):231–244Chen C, Lin J, Kuo S (2018) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140Kao Y, Chen Y (2016) Data-locality-aware MapReduce real-time scheduling framework. J Syst Softw 112:65–77Bok K, Hwang J, Lim J et al (2017) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24Chen Y, Borthakur D, Borthakur D et al (2012) Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In: ACM european conference on computer systems, pp 43–56Mashayekhy L, Nejad M, Grosu D et al (2015) Energy-aware scheduling of MapReduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733Lei H, Zhang T, Liu Y et al (2015) SGEESS: smart green energy-efficient scheduling strategy with dynamic electricity price for data center. J Syst Softw 108:23–38Oliveira D, Ocana K, Baiao F et al (2012) A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J Grid Comput 10(3):521–552Li S, Hu S, Abdelzaher T (2015) The packing server for real-time scheduling of MapReduce workflows. In: IEEE real-time and embedded technology and applications symposium, pp 51–62Cai Z, Li X, Ruiz R et al (2017) A delay-based dynamic scheduling algorithm for bag-of-task workflows with stochastic task execution times in clouds. Future Gener Comput Syst 71:57–72Cai Z, Li X, Ruiz R (2017) Resource provisioning for task-batch based workflows with deadlines in public clouds. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2017.2663426Cai Z, Li X, Gupta J (2016) Heuristics for provisioning services to workflows in XaaS clouds. IEEE Trans Serv Comput 9(2):250–263Li X, Cai Z (2017) Elastic resource provisioning for cloud workflow applications. IEEE Trans Autom Sci Eng 14(2):1195–1210Tang Z, Liu M, Ammar A et al (2014) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):1–21Xu C, Yang J, Yin K et al (2017) Optimal construction of virtual networks for cloud-based MapReduce workflows. Comput Netw 112:194–207Chiara S, Danilo A, Gianpaolo C et al (2013) Optimizing service selection and allocation in situational computing applications. IEEE Trans Serv Comput 6(3):414–428Baresi L, Elisabetta D, Carlo G et al (2007) A framework for the deployment of adaptable web service compositions. Serv Oriented Comput Appl 1(1):75–91Lim H, Herodotou H, Babu S (2012) Stubby: a transformation-based optimizer for MapReduce workflows. VLDB Endow 5(11):1196–1207Ke H, Li P, Guo S et al (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828Yu W, Wang Y, Que X et al (2015) Virtual shuffling for efficient data movement in MapReduce. IEEE Trans Comput 64(2):556–568Chowdhury M, Zaharia M, Ma J et al (2011) Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Comput Commun 41(4):98–109Guo D, Xie J, Zhou X et al (2015) Exploiting efficient and scalable shuffle transfers in future data center network. IEEE Trans Parallel Distrib Syst 26(4):997–1009Li D, Yu Y, He W et al (2015) Willow: saving data center network energy for network-limited flows. IEEE Trans Parallel Distrib Syst 26(9):2610–2620Tan J, Meng X, Zhang L (2013) Coupling task progress for MapReduce resource-aware scheduling. In: IEEE INFOCOM, pp 1618–1626Hammoud M, Rehman M, Sakr M (2012) Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In: International conference on cloud computing, pp 49–58Guo Z, Fox G, Zhou M et al (2012) Improving resource utilization in MapReduce. In: International conference on cluster computing, pp 402–410Fischer M, Su X, Yin Y (2010) Assigning tasks for efficiency in Hadoop. In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, pp 30–39Zhu Y, Jiang Y, Wu W et al (2014) Minimizing makespan and total completion time in MapReduce-like systems. In: IEEE INFOCOM, pp 2166–2174Kavulya S, Tan J, Gandhi R et al (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–103Abrishami S, Naghibzadeh M, Epema D (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service clouds. Future Gener Comput Syst 29(1):158–169Fernando B, Edmundo R (2010) Towards the scheduling of multiple workflows on computational grids. J Grid Comput 8(3):419–441Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–38Verma A, Cherkasova L, Campbell R (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secur Comput 10(5):314–327Heintz B, Chandra A, Sitaraman R et al (2017) End-to-end optimization for geo-distributed MapReduce. IEEE Trans Cloud Comput 4(3):293–306Chen L, Li X (2018) Cloud workflow scheduling with hybrid resource provisioning. J Supercomput 74(12):6529–6553Li X, Jiang T, Ruiz R (2016) Heuristics for periodical batch job scheduling in a MapReduce computing framework. Inf Sci 326:119–133Vanhoucheabcd M, Maenhout B, Tavares L (2008) An evaluation of the adequacy of project network generators with systematically sampled networks. Eur J Oper Res 187(2):511–52

    Računanje u oblaku korištenjem pythona

    Get PDF
    U ovom radu predoceno je računanje u oblaku pomoću programskog jezika Pythona, okruženje u kojem se računanje odvija te prednosti i nedostatci koje može donijeti korištenje ove tehnologije. U prvom poglavlju definiran je pojam računarstva u oblaku, njegova građa i karakteristike. Navedeni su modeli, tipovi i sudionici računarstva u oblaku, kao i razlozi zbog kojih bismo se trebali odlučiti za računarstvo u oblaku. U drugom poglavlju predočeni su MapReduce poslovi i njihova povezanost s računanjem u oblaku. Razrađen je jak alat Mrjob koji je poveznica između oblaka i samog procesa. Također, objašnjen je postupak za pokretanje procesa u oblaku (Amazonu), prikazani su primjeri i slike iz terminala i oblaka. Nadalje, opisan je pojam PiCloud i naveden je primjer procesa u njemu. U trećem poglavlju dana je ekonomska struktura, sigurnosni problemi te su predložena rješenja tih problema. Na samom kraju je opisan razvoj računarstva u oblaku kroz povijest te predviđanja smjera razvoja u budućnosti.In this work we present a possibility to us the cloud computing paradigm using programming language Python. This is an environment in which computing takes place and the advantages and disadvantages that this technology can make. The first chapter introduces the concept of cloud computing, its structure and characteristics. In this chapter we also mention models, various types and users of cloud computing and reasons for using cloud computing. In the second chapter were mentioned MapReduce jobs and their connection with the cloud computing. We presented a strong tool named Mrjob which is the link between the cloud and the process itself. Also, we described a procedure for starting this cloud process on Amazon cloud and have provided examples of usage. The third chapter contains an application in economic analysis. Also we outline some possibilities of future development

    Performance Analysis of Hadoop MapReduce And Apache Spark for Big Data

    Get PDF
    In the recent era, information has evolved at an exponential rate. In order to obtain new insights, this information must be carefully interpreted and analyzed. There is, therefore, a need for a system that can process data efficiently all the time. Distributed cloud computing data processing platforms are important tools for data analytics on a large scale. In this area, Apache Hadoop (High-Availability Distributed Object-Oriented Platform) MapReduce has evolved as the standard. The MapReduce job reads, processes its input data and then returns it to Hadoop Distributed Files Systems (HDFS). Although there is limitation to its programming interface, this has led to the development of modern data flow-oriented frameworks known as Apache Spark, which uses Resilient Distributed Datasets (RDDs) to execute data structures in memory. Since RDDs can be stored in the memory, algorithms can iterate very efficiently over its data many times. Cluster computing is a major investment for any organization that chooses to perform Big Data Analysis. The MapReduce and Spark were indeed two famous open-source cluster-computing frameworks for big data analysis. Cluster computing hides the task complexity and low latency with simple user-friendly programming. It improves performance throughput, and backup uptime should the main system fail. Its features include flexibility, task scheduling, higher availability, and faster processing speed. Big Data analytics has become more computer-intensive as data management becomes a big issue for scientific computation. High-Performance Computing is undoubtedly of great importance for big data processing. The main application of this research work is towards the realization of High-Performance Computing (HPC) for Big Data Analysis. This thesis work investigates the processing capability and efficiency of Hadoop MapReduce and Apache Spark using Cloudera Manager (CM). The Cloudera Manager provides end-to-end cluster management for Cloudera Distribution for Apache Hadoop (CDH). The implementation was carried out with Amazon Web Services (AWS). Amazon Web Service is used to configure window Virtual Machine (VM). Four Linux In-stances of free tier eligible t2.micro were launched using Amazon Elastic Compute Cloud (EC2). The Linux Instances were configured into four cluster nodes using Secure Socket Shell (SSH). A Big Data application is generated and injected while both MapReduce and Spark job are run with different queries such as scan, aggregation, two way and three-way join. The time taken for each task to be completed are recorded, observed, and thoroughly analyzed. It was observed that Spark executes job faster than MapReduce