5,340 research outputs found

    On Improving Efficiency of Data-Intensive Applications in Geo-Distributed Environments

    Get PDF
    Distributed systems are pervasively demanded and adopted in nowadays for processing data-intensive workloads since they greatly accelerate large-scale data processing with scalable parallelism and improved data locality. Traditional distributed systems initially targeted computing clusters but have since evolved to data centers with multiple clusters. These systems are mostly built on top of homogeneous, tightly integrated resources connected in high-speed local-area networks (LANs), and typically require data to be ingested to a central data center for processing. Today, with enormous volumes of data continuously generated from geographically distributed locations, direct adoption of such systems is prohibitively inefficient due to the limited system scalability and high cost for centralizing the geo-distributed data over the wide-area networks (WANs). More commonly, it becomes a trend to build geo-distributed systems wherein data processing jobs are performed on top of geo-distributed, heterogeneous resources in proximity to the data at vastly distributed geo-locations. However, critical challenges and mechanisms for efficient execution of data-intensive applications in such geo-distributed environments are unclear by far. The goal of this dissertation is to identify such challenges and mechanisms, by extensively using the research principles and methodology of conventional distributed systems to investigate the geo-distributed environment, and by developing new techniques to tackle these challenges and run data-intensive applications with efficiency at scale. The contributions of this dissertation are threefold. Firstly, the dissertation shows that the high level of resource heterogeneity exhibited in the geo-distributed environment undermines the scalability of geo-distributed systems. Virtualization-based resource abstraction mechanisms have been introduced to abstract the hardware, network, and OS resources throughout the system, to mitigate the underlying resource heterogeneity and enhance the system scalability. Secondly, the dissertation reveals the overwhelming performance and monetary cost incurred by indulgent data sharing over the WANs in geo-distributed systems. Network optimization approaches, including linear- programming-based global optimization, greedy bin-packing heuristics, and TCP enhancement, are developed to optimize the network resource utilization and circumvent unnecessary expenses imposed on data sharing in WANs. Lastly, the dissertation highlights the importance of data locality for data-intensive applications running in the geo-distributed environment. Novel data caching and locality-aware scheduling techniques are devised to improve the data locality.Doctor of Philosoph

    Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center

    Full text link
    [EN] Total profit is one of the most important factors to be considered from the perspective of resource providers. In this paper, an original MapReduce workflow scheduling with deadline and data locality is proposed to maximize total profit of resource providers. A new workflow conversion based on dynamic programming and ChainMap/ChainReduce is designed to decrease transmission times among MapReduce jobs of workflows. A new deadline division considering execution time, float time and job level is proposed to obtain better deadlines of MapReduce jobs in workflows. With the adapted replica strategy in MapReduce workflow, a new task scheduling is proposed to improve data locality which assigns tasks to servers with the earliest completion time in order to ensure resource providers obtain more profit. Experimental results show that the proposed heuristic results in larger total profit than other adopted algorithms.This work is supported by the National Key Research and Development Program of China (No. 2017YFB1400801), the National Natural Science Foundation of China (Nos. 61872077, 61832004) and Collaborative Innovation Center of Wireless Communications Technology. Rubén Ruiz is partly supported by the Spanish Ministry of Science, Innovation, and Universities, under the project ¿OPTEP-Port Terminal Operations Optimization¿ (No. RTI2018-094940-B-I00) financed with FEDER funds¿.Wang, J.; Li, X.; Ruiz García, R.; Xu, H.; Chu, D. (2020). Allocating MapReduce workflows with deadlines to heterogeneous servers in a cloud data center. Service Oriented Computing and Applications. 14(2):101-118. https://doi.org/10.1007/s11761-020-00290-1S101118142Zaharia M, Chowdhury M, Franklin M et al (2010) Spark: cluster computing with working sets. In: Usenix conference on hot topics in cloud computing, pp 1765–1773Li L, Ma Z, Liu L et al (2013) Hadoop-based ARIMA algorithm and its application in weather forecast. Int J Database Theory Appl 6(5):119–132Xun Y, Zhang J, Qin X (2017) FiDoop: parallel mining of frequent itemsets using MapReduce. IEEE Trans Syst Man Cybern Syst 46(3):313–325Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. IEEE Trans Cloud Comput 2(3):306–319Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–49Bu Y, Howe B, Balazinska M et al (2012) The HaLoop approach to large-scale iterative data analysis. VLDB J 21(2):169–190Gunarathne T, Zhang B, Wu T et al (2013) Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Gener Comput Syst 29(4):1035–1048Zhang Y, Gao Q, Gao L et al (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: International conference on parallel and distributed systems, pp 9–16Tang Z, Zhou J, Li K et al (2013) A MapReduce task scheduling algorithm for deadline constraints. Clust Comput 16(4):651–662Zhang W, Rajasekaran S, Wood T et al (2014) MIMP: deadline and interference aware scheduling of Hadoop virtual machines. In: International symposium on cluster, cloud and grid computing, pp 394–403Teng F, Magoulès F, Yu L et al (2014) A novel real-time scheduling algorithm and performance analysis of a MapReduce-based cloud. J Supercomput 69(2):739–765Palanisamy B, Singh A, Liu L (2015) Cost-effective resource provisioning for MapReduce in a cloud. IEEE Trans Parallel Distrib Syst 26(5):1265–1279Hashem I, Anuar N, Marjani M et al (2018) Multi-objective scheduling of MapReduce jobs in big data processing. Multimed Tools Appl 77(8):9979–9994Xu X, Tang M, Tian Y (2017) QoS-guaranteed resource provisioning for cloud-based MapReduce in dynamical environments. Future Gener Comput Syst 78(1):18–30Li H, Wei X, Fu Q et al (2014) MapReduce delay scheduling with deadline constraint. Concurr Comput Pract Exp 26(3):766–778Polo J, Becerra Y, Carrera D et al (2013) Deadline-based MapReduce workload management. IEEE Trans Netw Serv Manag 10(2):231–244Chen C, Lin J, Kuo S (2018) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140Kao Y, Chen Y (2016) Data-locality-aware MapReduce real-time scheduling framework. J Syst Softw 112:65–77Bok K, Hwang J, Lim J et al (2017) An efficient MapReduce scheduling scheme for processing large multimedia data. Multimed Tools Appl 76(16):1–24Chen Y, Borthakur D, Borthakur D et al (2012) Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In: ACM european conference on computer systems, pp 43–56Mashayekhy L, Nejad M, Grosu D et al (2015) Energy-aware scheduling of MapReduce jobs for big data applications. IEEE Trans Parallel Distrib Syst 26(10):2720–2733Lei H, Zhang T, Liu Y et al (2015) SGEESS: smart green energy-efficient scheduling strategy with dynamic electricity price for data center. J Syst Softw 108:23–38Oliveira D, Ocana K, Baiao F et al (2012) A provenance-based adaptive scheduling heuristic for parallel scientific workflows in clouds. J Grid Comput 10(3):521–552Li S, Hu S, Abdelzaher T (2015) The packing server for real-time scheduling of MapReduce workflows. In: IEEE real-time and embedded technology and applications symposium, pp 51–62Cai Z, Li X, Ruiz R et al (2017) A delay-based dynamic scheduling algorithm for bag-of-task workflows with stochastic task execution times in clouds. Future Gener Comput Syst 71:57–72Cai Z, Li X, Ruiz R (2017) Resource provisioning for task-batch based workflows with deadlines in public clouds. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2017.2663426Cai Z, Li X, Gupta J (2016) Heuristics for provisioning services to workflows in XaaS clouds. IEEE Trans Serv Comput 9(2):250–263Li X, Cai Z (2017) Elastic resource provisioning for cloud workflow applications. IEEE Trans Autom Sci Eng 14(2):1195–1210Tang Z, Liu M, Ammar A et al (2014) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):1–21Xu C, Yang J, Yin K et al (2017) Optimal construction of virtual networks for cloud-based MapReduce workflows. Comput Netw 112:194–207Chiara S, Danilo A, Gianpaolo C et al (2013) Optimizing service selection and allocation in situational computing applications. IEEE Trans Serv Comput 6(3):414–428Baresi L, Elisabetta D, Carlo G et al (2007) A framework for the deployment of adaptable web service compositions. Serv Oriented Comput Appl 1(1):75–91Lim H, Herodotou H, Babu S (2012) Stubby: a transformation-based optimizer for MapReduce workflows. VLDB Endow 5(11):1196–1207Ke H, Li P, Guo S et al (2016) On traffic-aware partition and aggregation in MapReduce for big data applications. IEEE Trans Parallel Distrib Syst 27(3):818–828Yu W, Wang Y, Que X et al (2015) Virtual shuffling for efficient data movement in MapReduce. IEEE Trans Comput 64(2):556–568Chowdhury M, Zaharia M, Ma J et al (2011) Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Comput Commun 41(4):98–109Guo D, Xie J, Zhou X et al (2015) Exploiting efficient and scalable shuffle transfers in future data center network. IEEE Trans Parallel Distrib Syst 26(4):997–1009Li D, Yu Y, He W et al (2015) Willow: saving data center network energy for network-limited flows. IEEE Trans Parallel Distrib Syst 26(9):2610–2620Tan J, Meng X, Zhang L (2013) Coupling task progress for MapReduce resource-aware scheduling. In: IEEE INFOCOM, pp 1618–1626Hammoud M, Rehman M, Sakr M (2012) Center-of-gravity reduce task scheduling to lower MapReduce network traffic. In: International conference on cloud computing, pp 49–58Guo Z, Fox G, Zhou M et al (2012) Improving resource utilization in MapReduce. In: International conference on cluster computing, pp 402–410Fischer M, Su X, Yin Y (2010) Assigning tasks for efficiency in Hadoop. In: Proceedings of the 22nd ACM symposium on parallelism in algorithms and architectures, pp 30–39Zhu Y, Jiang Y, Wu W et al (2014) Minimizing makespan and total completion time in MapReduce-like systems. In: IEEE INFOCOM, pp 2166–2174Kavulya S, Tan J, Gandhi R et al (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–103Abrishami S, Naghibzadeh M, Epema D (2013) Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service clouds. Future Gener Comput Syst 29(1):158–169Fernando B, Edmundo R (2010) Towards the scheduling of multiple workflows on computational grids. J Grid Comput 8(3):419–441Tiwari N, Sarkar S, Bellur U et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv 47(3):1–38Verma A, Cherkasova L, Campbell R (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secur Comput 10(5):314–327Heintz B, Chandra A, Sitaraman R et al (2017) End-to-end optimization for geo-distributed MapReduce. IEEE Trans Cloud Comput 4(3):293–306Chen L, Li X (2018) Cloud workflow scheduling with hybrid resource provisioning. J Supercomput 74(12):6529–6553Li X, Jiang T, Ruiz R (2016) Heuristics for periodical batch job scheduling in a MapReduce computing framework. Inf Sci 326:119–133Vanhoucheabcd M, Maenhout B, Tavares L (2008) An evaluation of the adequacy of project network generators with systematically sampled networks. Eur J Oper Res 187(2):511–52

    Scalable and Distributed Resource Management Protocols for Cloud and Big Data Clusters

    Get PDF
    Cloud data centers require an operating system to manage resources and satisfy operational requirements and management objectives. The growth of popularity in cloud services causes the appearance of a new spectrum of services with sophisticated workload and resource management requirements. Also, data centers are growing by addition of various type of hardware to accommodate the ever-increasing requests of users. Nowadays a large percentage of cloud resources are executing data-intensive applications which need continuously changing workload fluctuations and specific resource management. To this end, cluster computing frameworks are shifting towards distributed resource management for better scalability and faster decision making. Such systems benefit from the parallelization of control and are resilient to failures. Throughout this thesis we investigate algorithms, protocols and techniques to address these challenges in large-scale data centers. We introduce a distributed resource management framework which consolidates virtual machine to as few servers as possible to reduce the energy consumption of data center and hence decrease the cost of cloud providers. This framework can characterize the workload of virtual machines and hence handle trade-off energy consumption and Service Level Agreement (SLA) of customers efficiently. The algorithm is highly scalable and requires low maintenance cost with dynamic workloads and it tries to minimize virtual machines migration costs. We also introduce a scalable and distributed probe-based scheduling algorithm for Big data analytics frameworks. This algorithm can efficiently address the problem job heterogeneity in workloads that has appeared after increasing the level of parallelism in jobs. The algorithm is massively scalable and can reduce significantly average job completion times in comparison with the-state of-the-art. Finally, we propose a probabilistic fault-tolerance technique as part of the scheduling algorithm

    Network flow optimization for distributed clouds

    Get PDF
    Internet applications, which rely on large-scale networked environments such as data centers for their back-end support, are often geo-distributed and typically have stringent performance constraints. The interconnecting networks, within and across data centers, are critical in determining these applications' performance. Data centers can be viewed as composed of three layers: physical infrastructure consisting of servers, switches, and links, control platforms that manage the underlying resources, and applications that run on the infrastructure. This dissertation shows that network flow optimization can improve performance of distributed applications in the cloud by designing high-throughput schemes spanning all three layers. At the physical infrastructure layer, we devise a framework for measuring and understanding throughput of network topologies. We develop a heuristic for estimating the worst-case performance of any topology and propose a systematic methodology for comparing performance of networks built with different equipment. At the control layer, we put forward a source-routed data center fabric which can achieve near-optimal throughput performance by leveraging a large number of available paths while using limited memory in switches. At the application layer, we show that current Application Network Interfaces (ANIs), abstractions that translate an application's performance goals to actionable network objectives, fail to capture the requirements of many emerging applications. We put forward a novel ANI that can capture application intent more effectively and quantify performance gains achievable with it. We also tackle resource optimization in the inter-data center context of cellular providers. In this emerging environment, a large amount of resources are geographically fragmented across thousands of micro data centers, each with a limited share of resources, necessitating cross-application optimization to satisfy diverse performance requirements and improve network and server utilization. Our solution, Patronus, employs hierarchical optimization for handling multiple performance requirements and temporally partitioned scheduling for scalability

    Task Scheduling in Big Data Platforms: A Systematic Literature Review

    Get PDF
    Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms. Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms. Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms. Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided

    Optimizing Resource Management in Cloud Analytics Services

    Get PDF
    The fundamental challenge in the cloud today is how to build and optimize machine learning and data analytical services. Machine learning and data analytical platforms are changing computing infrastructure from expensive private data centers to easily accessible online services. These services pack user requests as jobs and run them on thousands of machines in parallel in geo-distributed clusters. The scale and the complexity of emerging jobs lead to increasing challenges for the clusters at all levels, from power infrastructure to system architecture and corresponding software framework design. These challenges come in many forms. Today's clusters are built on commodity hardware and hardware failures are unavoidable. Resource competition, network congestion, and mixed generations of hardware make the hardware environment complex and hard to model and predict. Such heterogeneity becomes a crucial roadblock for efficient parallelization on both the task level and job level. Another challenge comes from the increasing complexity of the applications. For example, machine learning services run jobs made up of multiple tasks with complex dependency structures. This complexity leads to difficulties in framework designs. The scale, especially when services span geo-distributed clusters, leads to another important hurdle for cluster design. Challenges also come from the power infrastructure. Power infrastructure is very expensive and accounts for more than 20% of the total costs to build a cluster. Power sharing optimization to maximize the facility utilization and smooth peak hour usages is another roadblock for cluster design. In this thesis, we focus on solutions for these challenges at the task level, on the job level, with respect to the geo-distributed data cloud design and for power management in colocation data centers. At the task level, a crucial hurdle to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers in simple workloads. We apply straggler mitigation for approximation jobs for the first time. We present GRASS, which carefully uses speculation to mitigate the impact of stragglers in approximation jobs. GRASS's design is based on the analysis of a model we develop to capture the optimal speculation levels for approximation jobs. Evaluations with production workloads from Facebook and Microsoft Bing in an EC2 cluster of 200 nodes show that GRASS increases accuracy of deadline-bound jobs by 47% and speeds up error-bound jobs by 38%. Moving from task level to job level, task level speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. Thus, we present Hopper, a job-level speculation-aware scheduler that integrates the tradeoffs associated with speculation into job scheduling decisions based on a model generalized from the task-level speculation model. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation. As computing resources move from local clusters to geo-distributed cloud services, we are expecting the same transformation for data storage. We study two crucial pieces of a geo-distributed data cloud system: data acquisition and data placement. Starting from developing the optimal algorithm for the case of a data cloud made up of a single data center, we propose a near-optimal, polynomial-time algorithm for a geo-distributed data cloud in general. We show, via a case study, that the resulting design, Datum, is near-optimal (within 1.6%) in practical settings. Efficient power management is a fundamental challenge for data centers when providing reliable services. Power oversubscription in data centers is very common and may occasionally trigger an emergency when the aggregate power demand exceeds the capacity. We study power capping solutions for handling such emergencies in a colocation data center, where the operator supplies power to multiple tenants. We propose a novel market mechanism based on supply function bidding, called COOP, to financially incentivize and coordinate tenants' power reduction for minimizing total performance loss while satisfying multiple power capping constraints. We demonstrate that COOP is "win-win", increasing the operator's profit (through oversubscription) and reducing tenants' costs (through financial compensation for their power reduction during emergencies).</p

    Geo-distributed Edge and Cloud Resource Management for Low-latency Stream Processing

    Get PDF
    The proliferation of Internet-of-Things (IoT) devices is rapidly increasing the demands for efficient processing of low latency stream data generated close to the edge of the network. Edge Computing provides a layer of infrastructure to fill latency gaps between the IoT devices and the back-end cloud computing infrastructure. A large number of IoT applications require continuous processing of data streams in real-time. Edge computing-based stream processing techniques that carefully consider the heterogeneity of the computing and network resources available in the geo-distributed infrastructure provide significant benefits in optimizing the throughput and end-to-end latency of the data streams. Managing geo-distributed resources operated by individual service providers raises new challenges in terms of effective global resource sharing and achieving global efficiency in the resource allocation process. In this dissertation, we present a distributed stream processing framework that optimizes the performance of stream processing applications through a careful allocation of computing and network resources available at the edge of the network. The proposed approach differentiates itself from the state-of-the-art through its careful consideration of data locality and resource constraints during physical plan generation and operator placement for the stream queries. Additionally, it considers co-flow dependencies that exist between the data streams to optimize the network resource allocation through an application-level rate control mechanism. The proposed framework incorporates resilience through a cost-aware partial active replication strategy that minimizes the recovery cost when applications incur failures. The framework employs a reinforcement learning-based online learning model for dynamically determining the level of parallelism to adapt to changing workload conditions. The second dimension of this dissertation proposes a novel model for allocating computing resources in edge and cloud computing environments. In edge computing environments, it allows service providers to establish resource sharing contracts with infrastructure providers apriori in a latency-aware manner. In geo-distributed cloud environments, it allows cloud service providers to establish resource sharing contracts with individual datacenters apriori for defined time intervals in a cost-aware manner. Based on these mechanisms, we develop a decentralized implementation of the contract-based resource allocation model for geo-distributed resources using Smart Contracts in Ethereum
    • …
    corecore