122 research outputs found
Modeling and simulation of data-driven applications in SDN-aware environments
PhD ThesisThe rising popularity of Software-Defined Networking (SDN) is increasing as it promises
to offer a window of opportunity and new features in terms of network performance,
configuration, and management. As such, SDN is exploited by several emerging applications and environments, such as cloud computing, edge computing, IoT, and data-
driven applications. Although SDN has demonstrated significant improvements in industry, still little research has explored the embracing of SDN in the area of cross-layer
optimization in different SDN-aware environments.
Each application and computing environment require different functionalities and Quality of Service (QoS) requirements. For example, a typical MapReduce application would require data transmission at three different times while the data transmission of stream-based applications would be unknown due to uncertainty about the number
of required tasks and dependencies among stream tasks. As such, the deployment of SDN with different applications are not identical, which require different deployment strategies and algorithms to meet different QoS requirements (e.g., high bandwidth,
deadline). Further, each application and environment has unique architectures, which impose different form of complexity in terms of computing, storage, and network.
Due to such complexities, finding optimal solutions for SDN-aware applications and environments become very challenging.
Therefore, this thesis presents multilateral research towards optimization, modeling, and simulation of cross-layer optimization of SDN-aware applications and environments. Several tools and algorithms have been proposed, implemented, and evaluated,
considering various environments and applications[1–4]. The main contributions of
this thesis are as follows:
• Proposing and modeling a new holistic framework that simulates MapReduce ap-
plications, big data management systems (BDMS), and SDN-aware networks in cloud-based environments. Theoretical and mathematical models of MapReduce
in SDN-aware cloud datacenters are also proposedThe government of Saudi Arabia represented
by Saudi Electronic University (SEU) and the Royal Embassy of Saudi Arabia Cultural
Burea
Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters
Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters.
Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks.
We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters.
In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues.
We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads.
Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 .
Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads
Intelligent Straggler Mitigation in Massive-Scale Computing Systems
In order to satisfy increasing demands for Cloud services, modern computing systems are often massive in scale, typically consisting of hundreds to thousands of heterogeneous machine nodes. Parallel computing frameworks such as MapReduce are widely deployed over such cluster infrastructure to provide reliable yet prompt services to customers. However, complex characteristics of Cloud workloads, including multi-dimensional resource requirements and highly changeable system environments, e.g. dynamic node performance, are introducing new challenges to service providers in terms of both customer experience and system efficiency. One primary challenge is the straggler problem, whereby a small subset of the parallelized tasks take abnormally longer execution time in comparison with the siblings, leading to extended job response and potential late-timing failure.
The state-of-the-art approach to straggler mitigation is speculative execution. Although it has been deployed in several real-world systems with a variety of implementation optimizations, the analysis from this thesis has shown that speculative execution is often inefficient. According to various production tracelogs of data centers, the failure rate of speculative execution could be as high as 71%. Straggler mitigation is a complicated problem in its own nature: 1) stragglers may lead to different consequences to parallel job execution, possibly with different degrees of severity, 2) whether a task should be regarded as a straggler is highly subjective, depending upon different application and system conditions, 3) the efficiency of speculative execution would be improved if dynamic node performance could be modelled and predicted appropriately, and 4) there are other types of stragglers, e.g. those caused by data skews, that are beyond the capability of speculative execution.
This thesis starts with a quantitative and rigorous analysis of issues with stragglers, including their root-causes and impacts, the execution environment running them, and the limitations to their mitigation. Scientific principles of straggler mitigation are investigated and new algorithms are developed. An intelligent system for straggler mitigation is then designed and developed, being compatible with the majority of current parallel computing frameworks. Combined with historical data analysis and online adaptation, the system is capable of mitigating stragglers intelligently, dynamically judging a task as a straggler and handling it, avoiding current weak nodes, and dealing with data skew, a special type of straggler, with a dedicated method. Comprehensive analysis and evaluation of the system show that it is able to reduce job response time by up to 55%, as compared with the speculator used in the default YARN system, while the optimal improvement a speculative-based method may achieve is around 66% in theory. The system also achieves a much higher success rate of speculation than other production systems, up to 89%
Task Scheduling in Big Data Platforms: A Systematic Literature Review
Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms.
Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms.
Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms.
Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
Service level agreement specification for IoT application workflow activity deployment, configuration and monitoring
PhD ThesisCurrently, we see the use of the Internet of Things (IoT) within various domains
such as healthcare, smart homes, smart cars, smart-x applications, and smart
cities. The number of applications based on IoT and cloud computing is projected
to increase rapidly over the next few years. IoT-based services must meet
the guaranteed levels of quality of service (QoS) to match users’ expectations.
Ensuring QoS through specifying the QoS constraints using service level agreements
(SLAs) is crucial. Also because of the potentially highly complex nature
of multi-layered IoT applications, lifecycle management (deployment, dynamic
reconfiguration, and monitoring) needs to be automated. To achieve this it is
essential to be able to specify SLAs in a machine-readable format.
currently available SLA specification languages are unable to accommodate
the unique characteristics (interdependency of its multi-layers) of the IoT domain.
Therefore, in this research, we propose a grammar for a syntactical structure
of an SLA specification for IoT. The grammar is based on a proposed conceptual
model that considers the main concepts that can be used to express the requirements
for most common hardware and software components of an IoT application
on an end-to-end basis. We follow the Goal Question Metric (GQM) approach to
evaluate the generality and expressiveness of the proposed grammar by reviewing
its concepts and their predefined lists of vocabularies against two use-cases
with a number of participants whose research interests are mainly related to IoT.
The results of the analysis show that the proposed grammar achieved 91.70% of
its generality goal and 93.43% of its expressiveness goal.
To enhance the process of specifying SLA terms, We then developed a toolkit
for creating SLA specifications for IoT applications. The toolkit is used to simplify
the process of capturing the requirements of IoT applications. We demonstrate
the effectiveness of the toolkit using a remote health monitoring service (RHMS)
use-case as well as applying a user experience measure to evaluate the tool by
applying a questionnaire-oriented approach. We discussed the applicability of our
tool by including it as a core component of two different applications: 1) a contextaware
recommender system for IoT configuration across layers; and 2) a tool for
automatically translating an SLA from JSON to a smart contract, deploying it
on different peer nodes that represent the contractual parties. The smart contract
is able to monitor the created SLA using Blockchain technology. These two
applications are utilized within our proposed SLA management framework for IoT.
Furthermore, we propose a greedy heuristic algorithm to decentralize workflow
activities of an IoT application across Edge and Cloud resources to enhance
response time, cost, energy consumption and network usage. We evaluated the
efficiency of our proposed approach using iFogSim simulator. The performance
analysis shows that the proposed algorithm minimized cost, execution time, networking,
and Cloud energy consumption compared to Cloud-only and edge-ward
placement approaches
Scalable and responsive real time event processing using cloud computing
PhD ThesisCloud computing provides the potential for scalability and adaptability in a cost e ective
manner. However, when it comes to achieving scalability for real time applications
response time cannot be high. Many applications require good performance and low
response time, which need to be matched with the dynamic resource allocation. The
real time processing requirements can also be characterized by unpredictable rates
of incoming data streams and dynamic outbursts of data. This raises the issue of
processing the data streams across multiple cloud computing nodes. This research
analyzes possible methodologies to process the real time data in which applications
can be structured as multiple event processing networks and be partitioned over the
set of available cloud nodes. The approach is based on queuing theory principles
to encompass the cloud computing. The transformation of the raw data into useful
outputs occurs in various stages of processing networks which are distributed across
the multiple computing nodes in a cloud. A set of valid options is created to understand
the response time requirements for each application. Under a given valid set of
conditions to meet the response time criteria, multiple instances of event processing
networks are distributed in the cloud nodes. A generic methodology to scale-up and
scale-down the event processing networks in accordance to the response time criteria
is de ned. The real time applications that support sophisticated decision support
mechanisms need to comply with response time criteria consisting of interdependent
data
ow paradigms making it harder to improve the performance. Consideration is
given for ways to reduce the latency,improve response time and throughput of the real
time applications by distributing the event processing networks in multiple computing
nodes
- …