122 research outputs found

    Modeling and simulation of data-driven applications in SDN-aware environments

    Get PDF
    PhD ThesisThe rising popularity of Software-Defined Networking (SDN) is increasing as it promises to offer a window of opportunity and new features in terms of network performance, configuration, and management. As such, SDN is exploited by several emerging applications and environments, such as cloud computing, edge computing, IoT, and data- driven applications. Although SDN has demonstrated significant improvements in industry, still little research has explored the embracing of SDN in the area of cross-layer optimization in different SDN-aware environments. Each application and computing environment require different functionalities and Quality of Service (QoS) requirements. For example, a typical MapReduce application would require data transmission at three different times while the data transmission of stream-based applications would be unknown due to uncertainty about the number of required tasks and dependencies among stream tasks. As such, the deployment of SDN with different applications are not identical, which require different deployment strategies and algorithms to meet different QoS requirements (e.g., high bandwidth, deadline). Further, each application and environment has unique architectures, which impose different form of complexity in terms of computing, storage, and network. Due to such complexities, finding optimal solutions for SDN-aware applications and environments become very challenging. Therefore, this thesis presents multilateral research towards optimization, modeling, and simulation of cross-layer optimization of SDN-aware applications and environments. Several tools and algorithms have been proposed, implemented, and evaluated, considering various environments and applications[1–4]. The main contributions of this thesis are as follows: • Proposing and modeling a new holistic framework that simulates MapReduce ap- plications, big data management systems (BDMS), and SDN-aware networks in cloud-based environments. Theoretical and mathematical models of MapReduce in SDN-aware cloud datacenters are also proposedThe government of Saudi Arabia represented by Saudi Electronic University (SEU) and the Royal Embassy of Saudi Arabia Cultural Burea

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Get PDF
    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

    Intelligent Straggler Mitigation in Massive-Scale Computing Systems

    Get PDF
    In order to satisfy increasing demands for Cloud services, modern computing systems are often massive in scale, typically consisting of hundreds to thousands of heterogeneous machine nodes. Parallel computing frameworks such as MapReduce are widely deployed over such cluster infrastructure to provide reliable yet prompt services to customers. However, complex characteristics of Cloud workloads, including multi-dimensional resource requirements and highly changeable system environments, e.g. dynamic node performance, are introducing new challenges to service providers in terms of both customer experience and system efficiency. One primary challenge is the straggler problem, whereby a small subset of the parallelized tasks take abnormally longer execution time in comparison with the siblings, leading to extended job response and potential late-timing failure. The state-of-the-art approach to straggler mitigation is speculative execution. Although it has been deployed in several real-world systems with a variety of implementation optimizations, the analysis from this thesis has shown that speculative execution is often inefficient. According to various production tracelogs of data centers, the failure rate of speculative execution could be as high as 71%. Straggler mitigation is a complicated problem in its own nature: 1) stragglers may lead to different consequences to parallel job execution, possibly with different degrees of severity, 2) whether a task should be regarded as a straggler is highly subjective, depending upon different application and system conditions, 3) the efficiency of speculative execution would be improved if dynamic node performance could be modelled and predicted appropriately, and 4) there are other types of stragglers, e.g. those caused by data skews, that are beyond the capability of speculative execution. This thesis starts with a quantitative and rigorous analysis of issues with stragglers, including their root-causes and impacts, the execution environment running them, and the limitations to their mitigation. Scientific principles of straggler mitigation are investigated and new algorithms are developed. An intelligent system for straggler mitigation is then designed and developed, being compatible with the majority of current parallel computing frameworks. Combined with historical data analysis and online adaptation, the system is capable of mitigating stragglers intelligently, dynamically judging a task as a straggler and handling it, avoiding current weak nodes, and dealing with data skew, a special type of straggler, with a dedicated method. Comprehensive analysis and evaluation of the system show that it is able to reduce job response time by up to 55%, as compared with the speculator used in the default YARN system, while the optimal improvement a speculative-based method may achieve is around 66% in theory. The system also achieves a much higher success rate of speculation than other production systems, up to 89%

    Task Scheduling in Big Data Platforms: A Systematic Literature Review

    Get PDF
    Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms. Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms. Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms. Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided

    Service level agreement specification for IoT application workflow activity deployment, configuration and monitoring

    Get PDF
    PhD ThesisCurrently, we see the use of the Internet of Things (IoT) within various domains such as healthcare, smart homes, smart cars, smart-x applications, and smart cities. The number of applications based on IoT and cloud computing is projected to increase rapidly over the next few years. IoT-based services must meet the guaranteed levels of quality of service (QoS) to match users’ expectations. Ensuring QoS through specifying the QoS constraints using service level agreements (SLAs) is crucial. Also because of the potentially highly complex nature of multi-layered IoT applications, lifecycle management (deployment, dynamic reconfiguration, and monitoring) needs to be automated. To achieve this it is essential to be able to specify SLAs in a machine-readable format. currently available SLA specification languages are unable to accommodate the unique characteristics (interdependency of its multi-layers) of the IoT domain. Therefore, in this research, we propose a grammar for a syntactical structure of an SLA specification for IoT. The grammar is based on a proposed conceptual model that considers the main concepts that can be used to express the requirements for most common hardware and software components of an IoT application on an end-to-end basis. We follow the Goal Question Metric (GQM) approach to evaluate the generality and expressiveness of the proposed grammar by reviewing its concepts and their predefined lists of vocabularies against two use-cases with a number of participants whose research interests are mainly related to IoT. The results of the analysis show that the proposed grammar achieved 91.70% of its generality goal and 93.43% of its expressiveness goal. To enhance the process of specifying SLA terms, We then developed a toolkit for creating SLA specifications for IoT applications. The toolkit is used to simplify the process of capturing the requirements of IoT applications. We demonstrate the effectiveness of the toolkit using a remote health monitoring service (RHMS) use-case as well as applying a user experience measure to evaluate the tool by applying a questionnaire-oriented approach. We discussed the applicability of our tool by including it as a core component of two different applications: 1) a contextaware recommender system for IoT configuration across layers; and 2) a tool for automatically translating an SLA from JSON to a smart contract, deploying it on different peer nodes that represent the contractual parties. The smart contract is able to monitor the created SLA using Blockchain technology. These two applications are utilized within our proposed SLA management framework for IoT. Furthermore, we propose a greedy heuristic algorithm to decentralize workflow activities of an IoT application across Edge and Cloud resources to enhance response time, cost, energy consumption and network usage. We evaluated the efficiency of our proposed approach using iFogSim simulator. The performance analysis shows that the proposed algorithm minimized cost, execution time, networking, and Cloud energy consumption compared to Cloud-only and edge-ward placement approaches

    Energy-Performance Optimization for the Cloud

    Get PDF

    Scalable and responsive real time event processing using cloud computing

    Get PDF
    PhD ThesisCloud computing provides the potential for scalability and adaptability in a cost e ective manner. However, when it comes to achieving scalability for real time applications response time cannot be high. Many applications require good performance and low response time, which need to be matched with the dynamic resource allocation. The real time processing requirements can also be characterized by unpredictable rates of incoming data streams and dynamic outbursts of data. This raises the issue of processing the data streams across multiple cloud computing nodes. This research analyzes possible methodologies to process the real time data in which applications can be structured as multiple event processing networks and be partitioned over the set of available cloud nodes. The approach is based on queuing theory principles to encompass the cloud computing. The transformation of the raw data into useful outputs occurs in various stages of processing networks which are distributed across the multiple computing nodes in a cloud. A set of valid options is created to understand the response time requirements for each application. Under a given valid set of conditions to meet the response time criteria, multiple instances of event processing networks are distributed in the cloud nodes. A generic methodology to scale-up and scale-down the event processing networks in accordance to the response time criteria is de ned. The real time applications that support sophisticated decision support mechanisms need to comply with response time criteria consisting of interdependent data ow paradigms making it harder to improve the performance. Consideration is given for ways to reduce the latency,improve response time and throughput of the real time applications by distributing the event processing networks in multiple computing nodes

    Essentials of Business Analytics

    Get PDF
    • …
    corecore