41 research outputs found
Heuristics for periodical batch job scheduling in a MapReduce computing framework
Task scheduling has a significant impact on the performance of the MapReduce computing
framework. In this paper, a scheduling problem of periodical batch jobs with makespan minimization
is considered. The problem is modeled as a general two-stage hybrid flow shop
scheduling problem with schedule-dependent setup times. The new model incorporates the
data locality of tasks and is formulated as an integer program. Three heuristics are developed
to solve the problem and an improvement policy based on data locality is presented to enhance
the methods. A lower bound of the makespan is derived. 150 instances are randomly
generated from data distributions drawn from a real cluster. The parameters involved in the
methods are set according to different cluster setups. The proposed heuristics are compared
over different numbers of jobs and cluster setups. Computational results show that the performance
of the methods is highly dependent on both the number of jobs and the cluster setups.
The proposed improvement policy is effective and the impact of the input data distribution on
the policy is analyzed and tested.This work is supported by the National Natural Science Foundation of China (No. 61272377) and the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20120092110027). Ruben Ruiz is partially supported by the Spanish Ministry of Economy and Competitiveness, under the project "RESULT - Realistic Extended Scheduling Using Light Techniques" (No. DPI2012-36243-C02-01) partially financed with FEDER funds.Xiaoping Li; Tianze Jiang; Ruiz García, R. (2016). Heuristics for periodical batch job scheduling in a MapReduce computing framework. Information Sciences. 326:119-133. https://doi.org/10.1016/j.ins.2015.07.040S11913332
Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center
[EN] Total profit is one of the most important factors to be considered from the perspective of resource providers. In this paper, an original MapReduce workflow scheduling with deadline and data locality is proposed to maximize total profit of resource providers. A new workflow conversion based on dynamic programming and ChainMap/ChainReduce is designed to decrease transmission times among MapReduce jobs of workflows. A new deadline division considering execution time, float time and job level is proposed to obtain better deadlines of MapReduce jobs in workflows. With the adapted replica strategy in MapReduce workflow, a new task scheduling is proposed to improve data locality which assigns tasks to servers with the earliest completion time in order to ensure resource providers obtain more profit. Experimental results show that the proposed heuristic results in larger total profit than other adopted algorithms.This work is supported by the National Key Research and Development Program of China (No. 2017YFB1400801), the National Natural Science Foundation of China (Nos. 61872077, 61832004) and Collaborative Innovation Center of Wireless Communications Technology. Rubén Ruiz is partly supported by the Spanish Ministry of Science, Innovation, and Universities, under
the project ¿OPTEP-Port Terminal Operations Optimization¿ (No. RTI2018-094940-B-I00) financed with FEDER funds¿.Wang, J.; Li, X.; Ruiz García, R.; Xu, H.; Chu, D. (2020). Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center. Service Oriented Computing and Applications. 14(2):101-118. https://doi.org/10.1007/s11761-020-00290-1S101118142Zaharia M, Chowdhury M, Franklin M et al (2010) Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing, pp 1765–1773Li L, Ma Z, Liu L et al (2013) Hadoop-based ARIMA algorithm and its application in weather forecast. Int J Database Theory Appl 6(5):119–132Xun Y, Zhang J, Qin X (2017) FiDoop: parallel mining of frequent itemsets using MapReduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. IEEE Trans Cloud Comput 2(3):306–319Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–49Bu Y, Howe B, Balazinska M et al (2012) The HaLoop approach to large-scale iterative data analysis. VLDB J 21(2):169–190Gunarathne T, Zhang B, Wu T et al (2013) Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Gener Comput Syst 29(4):1035–1048Zhang Y, Gao Q, Gao L et al (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: International conference on parallel and distributed systems, pp 9–16Tang Z, Zhou J, Li K et al (2013) A MapReduce task scheduling algorithm for deadline constraints. Clust Comput 16(4):651–662Zhang W, Rajasekaran S, Wood T et al (2014) MIMP: deadline and interference aware scheduling of Hadoop virtual machines. In: International symposium on cluster, cloud and grid computing, pp 394–403Teng F, Magoulès F, Yu L et al (2014) A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. J Supercomput 69(2):739–765Palanisamy B, Singh A, Liu L (2015) Cost-effective resource provisioning for MapReduce in a cloud. IEEE Trans Parallel Distrib Syst 26(5):1265–1279Hashem I, Anuar N, Marjani M et al (2018) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):9979–9994Xu X, Tang M, Tian Y (2017) QoS-guaranteed resource provisioning for cloud-based MapReduce in dynamical environments. Future Gener Comput Syst 78(1):18–30Li H, Wei X, Fu Q et al (2014) MapReduce delay scheduling with deadline constraint. Concurr Comput Pract Exp 26(3):766–778Polo J, Becerra Y, Carrera D et al (2013) Deadline-based MapReduce workload management. IEEE Trans Netw Serv Manag 10(2):231–244Chen C, Lin J, Kuo S (2018) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140Kao Y, Chen Y (2016) Data-locality-aware MapReduce real-time scheduling framework. J Syst Softw 112:65–77Bok K, Hwang J, Lim J et al (2017) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24Chen Y, Borthakur D, Borthakur D et al (2012) Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In: ACM european conference on computer systems, pp 43–56Mashayekhy L, Nejad M, Grosu D et al (2015) Energy-aware scheduling of MapReduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733Lei H, Zhang T, Liu Y et al (2015) SGEESS: smart green energy-efficient scheduling strategy with dynamic electricity price for data center. J Syst Softw 108:23–38Oliveira D, Ocana K, Baiao F et al (2012) A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J Grid Comput 10(3):521–552Li S, Hu S, Abdelzaher T (2015) The packing server for real-time scheduling of MapReduce workflows. In: IEEE real-time and embedded technology and applications symposium, pp 51–62Cai Z, Li X, Ruiz R et al (2017) A delay-based dynamic scheduling algorithm for bag-of-task workflows with stochastic task execution times in clouds. Future Gener Comput Syst 71:57–72Cai Z, Li X, Ruiz R (2017) Resource provisioning for task-batch based workflows with deadlines in public clouds. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2017.2663426Cai Z, Li X, Gupta J (2016) Heuristics for provisioning services to workflows in XaaS clouds. IEEE Trans Serv Comput 9(2):250–263Li X, Cai Z (2017) Elastic resource provisioning for cloud workflow applications. IEEE Trans Autom Sci Eng 14(2):1195–1210Tang Z, Liu M, Ammar A et al (2014) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):1–21Xu C, Yang J, Yin K et al (2017) Optimal construction of virtual networks for cloud-based MapReduce workflows. Comput Netw 112:194–207Chiara S, Danilo A, Gianpaolo C et al (2013) Optimizing service selection and allocation in situational computing applications. IEEE Trans Serv Comput 6(3):414–428Baresi L, Elisabetta D, Carlo G et al (2007) A framework for the deployment of adaptable web service compositions. Serv Oriented Comput Appl 1(1):75–91Lim H, Herodotou H, Babu S (2012) Stubby: a transformation-based optimizer for MapReduce workflows. VLDB Endow 5(11):1196–1207Ke H, Li P, Guo S et al (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828Yu W, Wang Y, Que X et al (2015) Virtual shuffling for efficient data movement in MapReduce. IEEE Trans Comput 64(2):556–568Chowdhury M, Zaharia M, Ma J et al (2011) Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Comput Commun 41(4):98–109Guo D, Xie J, Zhou X et al (2015) Exploiting efficient and scalable shuffle transfers in future data center network. IEEE Trans Parallel Distrib Syst 26(4):997–1009Li D, Yu Y, He W et al (2015) Willow: saving data center network energy for network-limited flows. IEEE Trans Parallel Distrib Syst 26(9):2610–2620Tan J, Meng X, Zhang L (2013) Coupling task progress for MapReduce resource-aware scheduling. In: IEEE INFOCOM, pp 1618–1626Hammoud M, Rehman M, Sakr M (2012) Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In: International conference on cloud computing, pp 49–58Guo Z, Fox G, Zhou M et al (2012) Improving resource utilization in MapReduce. In: International conference on cluster computing, pp 402–410Fischer M, Su X, Yin Y (2010) Assigning tasks for efficiency in Hadoop. In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, pp 30–39Zhu Y, Jiang Y, Wu W et al (2014) Minimizing makespan and total completion time in MapReduce-like systems. In: IEEE INFOCOM, pp 2166–2174Kavulya S, Tan J, Gandhi R et al (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–103Abrishami S, Naghibzadeh M, Epema D (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service clouds. Future Gener Comput Syst 29(1):158–169Fernando B, Edmundo R (2010) Towards the scheduling of multiple workflows on computational grids. J Grid Comput 8(3):419–441Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–38Verma A, Cherkasova L, Campbell R (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secur Comput 10(5):314–327Heintz B, Chandra A, Sitaraman R et al (2017) End-to-end optimization for geo-distributed MapReduce. IEEE Trans Cloud Comput 4(3):293–306Chen L, Li X (2018) Cloud workflow scheduling with hybrid resource provisioning. J Supercomput 74(12):6529–6553Li X, Jiang T, Ruiz R (2016) Heuristics for periodical batch job scheduling in a MapReduce computing framework. Inf Sci 326:119–133Vanhoucheabcd M, Maenhout B, Tavares L (2008) An evaluation of the adequacy of project network generators with systematically sampled networks. Eur J Oper Res 187(2):511–52
Processing Data-Intensive Workflows in the Cloud
In the recent years, large-scale data analysis has become critical to the success of modern enterprise. Meanwhile, with the emergence of cloud computing, companies are attracted to move their data analytics tasks to the cloud due to its exible, on demand resources usage and pay-as-you-go pricing model. MapReduce has been widely recognized as an important tool for performing large-scale data analysis in the cloud. It provides a simple and fault-tolerance framework for users to process data-intensive analytics tasks in parallel across dierent physical machines. In this report, we survey alternative implementations of MapReduce, contrasting batched-oriented and pipelined execution models and study how these models impact response times, completion time and robustness. Next, we present three optimization strategies for MapReduce-style work- ows, including (1) scan sharing across MapReduce programs, (2) work- ow optimizations aimed at reducing intermediate data, and (3) schedul- ing policies that map work ow tasks to dierent machines in order to minimize completion times and monetary costs. We conclude with a brief comparison across these optimization strate- gies, and discuss their pros/cons as well as performance implications of using more than one optimization strategy at a time.University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-12-07
Scheduling Stochastic Multi-Stage Jobs to Elastic Hybrid Cloud Resources
[EN] We consider a special workflow scheduling problem in a hybrid-cloud-based workflow management system in which tasks are linearly dependent, compute-intensive, stochastic, deadline-constrained and executed on elastic and distributed cloud resources. This kind of problems closely resemble many real-time and workflow-based applications. Three optimization objectives are explored: number, usage time and utilization of rented VMs. An iterated heuristic framework is presented to schedule jobs event by event which mainly consists of job collecting and event scheduling. Two job collecting strategies are proposed and two timetabling methods are developed. The proposed methods are calibrated through detailed designs of experiments and sound statistical techniques. With the calibrated components and parameters, the proposed algorithm is compared to existing methods for related problems. Experimental results show that the proposal is robust and effective for the problems under study.This work is sponsored by the National Natural Science Foundations of China (Nos. 71401079, 61572127, 61472192), the National Key Research and Development Program of China (No. 2017YFB1400801) and the Collaborative Innovation Center of Wireless Communications Technology. Ruben Ruiz is partially supported by the Spanish Ministry of Economy and Competitiveness, under the project "SCHEYARD-Optimization of Scheduling Problems in Container Yards" (No. DPI2015-65895-R) financed by FEDER funds.Zhu, J.; Li, X.; Ruiz García, R.; Xu, X. (2018). Scheduling Stochastic Multi-Stage Jobs to Elastic Hybrid Cloud Resources. IEEE Transactions on Parallel and Distributed Systems. 29(6):1401-1415. https://doi.org/10.1109/TPDS.2018.2793254S1401141529
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters
Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters.
Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks.
We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters.
In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues.
We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads.
Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 .
Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
Scheduling Periodical Multi-Stage Jobs With Fuzziness to Elastic Cloud Resources
© 2020 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] We investigate a workflow scheduling problem with stochastic task arrival times and fuzzy task processing times and due dates. The problem is common in many real-time and workflow-based applications, where tasks with fixed stage number and linearly dependency are executed on scalable cloud resources with multiple price options. The challenges lie in proposing effective, stable, and robust algorithms under stochastic and fuzzy tasks. A triangle fuzzy number-based model is formulated. Two metrics are explored: the cost and the degree of satisfaction. An iterated heuristic framework is proposed to periodically schedule tasks, which consists of a task collection and a fuzzy task scheduling phases. Two task collection strategies are presented and two task prioritization strategies are employed. In order to achieve a high satisfaction degree, deadline constraints are defined at both job and task levels. By designing delicate experiments and applying sophisticated statistical techniques, experimental results show that the proposed algorithm is more effective and robust than the two existing methods.This work was supported by the National Key Research and Development Program of China (No. 2017YFB1400800), the National Natural Science Foundation of China (Nos. 61672297, 61872077, and 61832004), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant No. 18KJB520039) and the National Science Foundation for Post-doctoral Scientists of China (Grant No. 2018M640510). Ruben Ruiz was partially supported by the Spanish Ministry of Science, Innovation, and Universities, under the project "OPTEP-Port Terminal Operations Optimization" (No. RTI2018-094940-B-I00) financed with FEDER funds. The authors would like to thank the anonymous reviewers for their valuable feedback on this work.Zhu, J.; Li, X.; Ruiz García, R.; Li, W.; Huang, H.; Zomaya, AY. (2020). Scheduling Periodical Multi-Stage Jobs With Fuzziness to Elastic Cloud Resources. IEEE Transactions on Parallel and Distributed Systems. 31(12):2819-2833. https://doi.org/10.1109/TPDS.2020.3004134S28192833311
Platform as a service integration for scientific computing using DIRAC
Cada día crece máis a demanda de recursos de computación requirida polos investigadores,
capacidades de cálculo que coexisten co crecente volume de datos xerado actualmente. Estes
investigadores están a demandar un servizo de Computación de Altas Prestacións (HPC) que
permita a execución das suas simulacións dunha forma na que se deslocalicen os recursos para
poder acceder aos máximos posibles, facilitandoo coa forma o máis cómoda e segura para eles.
Doutra banda, as universidades están conectadas con centros de investigación con redes que
pusuen unha velocidade e fiabilidade que posibilitan a execución de traballos de cálculo
científico. As capacidades de computo existentes en universidades van dende aulas informáticas
para usos docentes, laboratorios, etc., ata clusters de ordenadores pertencentes a grupos de
investigación. Usando tecnoloxías grid e cloud estes recursos computacionais heteroxéneos
poderían ser reutilizados polos investigadores para realizar simulacións, aportando unha maior
cantidade de cómputo a xa existente e deslocalizando os recursos entre distintos lugares ao
redor do planeta. O obxectivo desta tese é adaptar a contorna para computación distribuída
DIRAC, desenvolvida para o proxecto LHCb do CERN, para o seu uso por varias comunidades de
usuarios baseado nas tecnoloxías cloud e big data. Esta contorna pusuiría repositorios de
software centralizados que permitan proveer o software necesario para que a través dos
entornos na nube se poidan executar as aplicacións dos investigadores en calquera parte do
planeta dunha forma escalable, permitindo aprobeitar tanto recursos dedicados como nondedicados.
Avaliando así a execución desta plataforma para a realización de cálculos científicos.
Este traballo comezará coa obtención de requisitos, para pasar despois ao proceso de
integración básica. Posteriormente, optimizarase o uso do software cientifico empregado para as
contornas cloud, tratando de adaptalo aos entornos virtualizados. Para iso, será necesario
realizar un estudo estadístico que sexa o máis próximo posible aos entornos en producción para
poder determinar e crear as infraestructuras adaptadas evitando así a perda de rendemento
dentro de recursos. O seguinte caso sería utilizar as tecnoloxías virtualizadas, adaptando as
arquitecturas creadas, para a creación de sistemas que permitan o envío de traballos que
requiran de grandes cantidades de datos no eido do big data dunha forma distribuida