5 research outputs found

    Data-intensive Scheduling

    Get PDF
    In many modern data management scenarios, we encounter tasks, operations or computational phases that are data-intensive where the sheer volume of data proves to be overwhelming to handle and becomes a performance bottleneck. For data-intensive tasks, the bottleneck is data loading, where the cost of loading data into memory is more significant than the cost of actual computation. For data-intensive shuffling, the bottleneck is data transfer, where intermediate data are scattered and shuffled for further processing. This thesis addresses two data-intensive scheduling problems: (1) multi-processor scheduling for data-intensive tasks to reduce redundant data loading; (2) reducer scheduling for data-intensive shuffling to reduce redundant data communication. For data-intensive tasks, we focus on workloads with precedence constraints of data dependencies, which are common in various applications such as data analytics and ETL processing. These workloads are often known in advance, are presented as directed acyclic graphs (DAG), and are data-intensive and sensitive to cache misses. We solve the problem of scheduling DAGs of data-intensive tasks on multiple processors or machines, in order to minimize execution time. To do so, we propose scheduling algorithms that take cache misses into account. Simulations and an experimental evaluation using a Spark cluster demonstrate the advantages of our solutions in terms of workload completion time. For data-intensive shuffling, we focus on MapReduce-style processing. Communication overhead is incurred in the Shuffle stage which sends intermediate results from mappers to reducers. We solve this problem: given a collection of mapper outputs (intermediate key-value pairs) and a partitioning of this collection among the reducers, which node should each reducer run on to minimize data transfer? We reduce two natural formulations of this problem to optimization problems for which polynomial solutions exist. We show that our techniques can cut communication costs by 50 percent or more compared to Hadoop’s default reducer placement, which leads to lower network utilization and faster MapReduce job runtimes

    Deep Data Locality on Apache Hadoop

    Full text link
    The amount of data being collected in various areas such as social media, network, scientific instrument, mobile devices, and sensors is growing continuously, and the technology to process them is also advancing rapidly. One of the fundamental technologies to process big data is Apache Hadoop that has been adopted by many commercial products, such as InfoSphere by IBM, or Spark by Cloudera. MapReduce on Hadoop has been widely used in many data science applications. As a dominant big data processing platform, the performance of MapReduce on Hadoop system has a significant impact on the big data processing capability across multiple industries. Most of the research for improving the speed of big data analysis has been on Hadoop modules such as Hadoop common, Hadoop Distribute File System (HDFS), Hadoop Yet Another Resource Negotiator (YARN) and Hadoop MapReduce. In this research, we focused on data locality on HDFS to improve the performance of MapReduce. To reduce the amount of data transfer, MapReduce has been utilizing data locality. However, even though the majority of the processing cost occurs in the later stages, data locality has been utilized only in the early stages, which we call Shallow Data Locality (SDL). As a result, the benefit of data locality has not been fully realized. We have explored a new concept called Deep Data Locality (DDL) where the data is pre-arranged to maximize the locality in the later stages. Specifically, we introduce two implementation methods of the DDL, i.e., block-based DDL and key-based DDL. In block-based DDL, the data blocks are pre-arranged to reduce the block copying time in two ways. First the RLM blocks are eliminated. Under the conventional default block placement policy (DBPP), data blocks are randomly placed on any available slave nodes, requiring a copy of RLM (Rack-Local Map) blocks. In block-based DDL, blocks are placed to avoid RLMs to reduce the block copy time. Second, block-based DDL concentrates the blocks in a smaller number of nodes and reduces the data transfer time among them. We analyzed the block distribution status with the customer review data from TripAdvisor and measured the performances with Terasort Benchmark. Our test result shows that the execution times of Map and Shuffle have been improved by up to 25% and 31% respectively. In key-based DDL, the input data is divided into several blocks and stored in HDFS before going into the Map stage. In comparison with conventional blocks that have random keys, our blocks have a unique key. This requires a pre-sorting of the key-value pairs, which can be done during ETL process. This eliminates some data movements in map, shuffle, and reduce stages, and thereby improves the performance. In our experiments, MapReduce with key-based DDL performed 21.9% faster than default MapReduce and 13.3% faster than MapReduce with block-based DDL. Additionally, key-based DDL can be combined with other methods to further improve the performance. When key-based DDL and block-based DDL are combined, the Hadoop performance went up by 34.4%. In this research, we developed the MapReduce workflow models with a novel computational model. We developed a numerical simulator that integrates the computational models. The model faithfully predicts the Hadoop performance under various conditions

    Designing, Building, and Modeling Maneuverable Applications within Shared Computing Resources

    Get PDF
    Extending the military principle of maneuver into war-fighting domain of cyberspace, academic and military researchers have produced many theoretical and strategic works, though few have focused on researching actual applications and systems that apply this principle. We present our research in designing, building and modeling maneuverable applications in order to gain the system advantages of resource provisioning, application optimization, and cybersecurity improvement. We have coined the phrase “Maneuverable Applications” to be defined as distributed and parallel application that take advantage of the modification, relocation, addition or removal of computing resources, giving the perception of movement. Our work with maneuverable applications has been within shared computing resources, such as the Clemson University Palmetto cluster, where multiple users share access and time to a collection of inter-networked computers and servers. In this dissertation, we describe our implementation and analytic modeling of environments and systems to maneuver computational nodes, network capabilities, and security enhancements for overcoming challenges to a cyberspace platform. Specifically we describe our work to create a system to provision a big data computational resource within academic environments. We also present a computing testbed built to allow researchers to study network optimizations of data centers. We discuss our Petri Net model of an adaptable system, which increases its cybersecurity posture in the face of varying levels of threat from malicious actors. Lastly, we present work and investigation into integrating these technologies into a prototype resource manager for maneuverable applications and validating our model using this implementation

    Dynamic Workload Balancing and Scheduling in Hadoop Mapreduce with Software Defined Networking

    Get PDF
    Hadoop offers a platform to process big data. Hadoop Distributed File System (HDFS) and MapReduce are two components of Hadoop. Hadoop adopts HDFS which is a distributed file system for storing data and MapReduce for processing this data for users. Hadoop stores data based on space utilization of datanodes, without considering the processing capability and busy level during the running time of each datanode. Furthermore datanodes may be not homogeneous as Hadoop may run in a heterogeneous environment. For these reasons, workload imbalances will appear and result in poor performance. We propose a dynamic algorithm that considers space availability, processing capability and busy level of datanodes to ensure workload balance between different racks. Our results show that the execution time of map tasks moved will be reduced by more than 50%. Furthermore, we propose a method in which Hadoop runs on a Software Defined Network in order to further improve the performance by allowing fast and adaptable data transfers between racks. By installing OpenFlow switches to replace classical switches on a Hadoop cluster, we can modify the topology of the network between racks in order to enlarge the bandwidth if large amounts of data need to be transferred from one rack to another. Our results show that the execution time of map tasks moved is significantly reduced by about 50% when employing our proposed Hadoop cluster Bandwidth Routing algorithm. Apache YARN is the second generation of MapReduce. YARN has three built-in schedulers: the FIFO, Fair and Capacity Scheduler. Though these schedulers provide users different methods to allocate resources of a Hadoop cluster to execute their MapReduce jobs, they do not guarantee that their jobs will be executed within a specific deadline. We propose a deadline constraint scheduler algorithm for Hadoop. This algorithm uses a statistical approach to measure the performance of datanodes and based on this information the proposed algorithm creates several check points to monitor the progress of a job. Based on the progress of jobs at every checkpoint the proposed scheduler will assign them to different job queues. These queues will have different priorities and the proportion of resources used by these queues will depend on their priority. The results of our experiments show that the proposed scheduler ensures that jobs will be completed within a given deadline whereas the native schedulers cannot guarantee this. Moreover, the average job execution time in the proposed scheduler is 56% and 15% less when compared to the Fair and EDF schedulers respectively.Computer Scienc

    Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center

    Full text link
    [EN] Total profit is one of the most important factors to be considered from the perspective of resource providers. In this paper, an original MapReduce workflow scheduling with deadline and data locality is proposed to maximize total profit of resource providers. A new workflow conversion based on dynamic programming and ChainMap/ChainReduce is designed to decrease transmission times among MapReduce jobs of workflows. A new deadline division considering execution time, float time and job level is proposed to obtain better deadlines of MapReduce jobs in workflows. With the adapted replica strategy in MapReduce workflow, a new task scheduling is proposed to improve data locality which assigns tasks to servers with the earliest completion time in order to ensure resource providers obtain more profit. Experimental results show that the proposed heuristic results in larger total profit than other adopted algorithms.This work is supported by the National Key Research and Development Program of China (No. 2017YFB1400801), the National Natural Science Foundation of China (Nos. 61872077, 61832004) and Collaborative Innovation Center of Wireless Communications Technology. Rubén Ruiz is partly supported by the Spanish Ministry of Science, Innovation, and Universities, under the project ¿OPTEP-Port Terminal Operations Optimization¿ (No. RTI2018-094940-B-I00) financed with FEDER funds¿.Wang, J.; Li, X.; Ruiz García, R.; Xu, H.; Chu, D. (2020). Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center. Service Oriented Computing and Applications. 14(2):101-118. https://doi.org/10.1007/s11761-020-00290-1S101118142Zaharia M, Chowdhury M, Franklin M et al (2010) Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing, pp 1765–1773Li L, Ma Z, Liu L et al (2013) Hadoop-based ARIMA algorithm and its application in weather forecast. Int J Database Theory Appl 6(5):119–132Xun Y, Zhang J, Qin X (2017) FiDoop: parallel mining of frequent itemsets using MapReduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. IEEE Trans Cloud Comput 2(3):306–319Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–49Bu Y, Howe B, Balazinska M et al (2012) The HaLoop approach to large-scale iterative data analysis. VLDB J 21(2):169–190Gunarathne T, Zhang B, Wu T et al (2013) Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Gener Comput Syst 29(4):1035–1048Zhang Y, Gao Q, Gao L et al (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: International conference on parallel and distributed systems, pp 9–16Tang Z, Zhou J, Li K et al (2013) A MapReduce task scheduling algorithm for deadline constraints. Clust Comput 16(4):651–662Zhang W, Rajasekaran S, Wood T et al (2014) MIMP: deadline and interference aware scheduling of Hadoop virtual machines. In: International symposium on cluster, cloud and grid computing, pp 394–403Teng F, Magoulès F, Yu L et al (2014) A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. J Supercomput 69(2):739–765Palanisamy B, Singh A, Liu L (2015) Cost-effective resource provisioning for MapReduce in a cloud. IEEE Trans Parallel Distrib Syst 26(5):1265–1279Hashem I, Anuar N, Marjani M et al (2018) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):9979–9994Xu X, Tang M, Tian Y (2017) QoS-guaranteed resource provisioning for cloud-based MapReduce in dynamical environments. Future Gener Comput Syst 78(1):18–30Li H, Wei X, Fu Q et al (2014) MapReduce delay scheduling with deadline constraint. Concurr Comput Pract Exp 26(3):766–778Polo J, Becerra Y, Carrera D et al (2013) Deadline-based MapReduce workload management. IEEE Trans Netw Serv Manag 10(2):231–244Chen C, Lin J, Kuo S (2018) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140Kao Y, Chen Y (2016) Data-locality-aware MapReduce real-time scheduling framework. J Syst Softw 112:65–77Bok K, Hwang J, Lim J et al (2017) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24Chen Y, Borthakur D, Borthakur D et al (2012) Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In: ACM european conference on computer systems, pp 43–56Mashayekhy L, Nejad M, Grosu D et al (2015) Energy-aware scheduling of MapReduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733Lei H, Zhang T, Liu Y et al (2015) SGEESS: smart green energy-efficient scheduling strategy with dynamic electricity price for data center. J Syst Softw 108:23–38Oliveira D, Ocana K, Baiao F et al (2012) A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J Grid Comput 10(3):521–552Li S, Hu S, Abdelzaher T (2015) The packing server for real-time scheduling of MapReduce workflows. In: IEEE real-time and embedded technology and applications symposium, pp 51–62Cai Z, Li X, Ruiz R et al (2017) A delay-based dynamic scheduling algorithm for bag-of-task workflows with stochastic task execution times in clouds. Future Gener Comput Syst 71:57–72Cai Z, Li X, Ruiz R (2017) Resource provisioning for task-batch based workflows with deadlines in public clouds. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2017.2663426Cai Z, Li X, Gupta J (2016) Heuristics for provisioning services to workflows in XaaS clouds. IEEE Trans Serv Comput 9(2):250–263Li X, Cai Z (2017) Elastic resource provisioning for cloud workflow applications. IEEE Trans Autom Sci Eng 14(2):1195–1210Tang Z, Liu M, Ammar A et al (2014) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):1–21Xu C, Yang J, Yin K et al (2017) Optimal construction of virtual networks for cloud-based MapReduce workflows. Comput Netw 112:194–207Chiara S, Danilo A, Gianpaolo C et al (2013) Optimizing service selection and allocation in situational computing applications. IEEE Trans Serv Comput 6(3):414–428Baresi L, Elisabetta D, Carlo G et al (2007) A framework for the deployment of adaptable web service compositions. Serv Oriented Comput Appl 1(1):75–91Lim H, Herodotou H, Babu S (2012) Stubby: a transformation-based optimizer for MapReduce workflows. VLDB Endow 5(11):1196–1207Ke H, Li P, Guo S et al (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828Yu W, Wang Y, Que X et al (2015) Virtual shuffling for efficient data movement in MapReduce. IEEE Trans Comput 64(2):556–568Chowdhury M, Zaharia M, Ma J et al (2011) Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Comput Commun 41(4):98–109Guo D, Xie J, Zhou X et al (2015) Exploiting efficient and scalable shuffle transfers in future data center network. IEEE Trans Parallel Distrib Syst 26(4):997–1009Li D, Yu Y, He W et al (2015) Willow: saving data center network energy for network-limited flows. IEEE Trans Parallel Distrib Syst 26(9):2610–2620Tan J, Meng X, Zhang L (2013) Coupling task progress for MapReduce resource-aware scheduling. In: IEEE INFOCOM, pp 1618–1626Hammoud M, Rehman M, Sakr M (2012) Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In: International conference on cloud computing, pp 49–58Guo Z, Fox G, Zhou M et al (2012) Improving resource utilization in MapReduce. In: International conference on cluster computing, pp 402–410Fischer M, Su X, Yin Y (2010) Assigning tasks for efficiency in Hadoop. In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, pp 30–39Zhu Y, Jiang Y, Wu W et al (2014) Minimizing makespan and total completion time in MapReduce-like systems. In: IEEE INFOCOM, pp 2166–2174Kavulya S, Tan J, Gandhi R et al (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–103Abrishami S, Naghibzadeh M, Epema D (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service clouds. Future Gener Comput Syst 29(1):158–169Fernando B, Edmundo R (2010) Towards the scheduling of multiple workflows on computational grids. J Grid Comput 8(3):419–441Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–38Verma A, Cherkasova L, Campbell R (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secur Comput 10(5):314–327Heintz B, Chandra A, Sitaraman R et al (2017) End-to-end optimization for geo-distributed MapReduce. IEEE Trans Cloud Comput 4(3):293–306Chen L, Li X (2018) Cloud workflow scheduling with hybrid resource provisioning. J Supercomput 74(12):6529–6553Li X, Jiang T, Ruiz R (2016) Heuristics for periodical batch job scheduling in a MapReduce computing framework. Inf Sci 326:119–133Vanhoucheabcd M, Maenhout B, Tavares L (2008) An evaluation of the adequacy of project network generators with systematically sampled networks. Eur J Oper Res 187(2):511–52
    corecore