5 research outputs found

    Hybrid ant colony system algorithm for static and dynamic job scheduling in grid computing

    Get PDF
    Grid computing is a distributed system with heterogeneous infrastructures. Resource management system (RMS) is one of the most important components which has great influence on the grid computing performance. The main part of RMS is the scheduler algorithm which has the responsibility to map submitted tasks to available resources. The complexity of scheduling problem is considered as a nondeterministic polynomial complete (NP-complete) problem and therefore, an intelligent algorithm is required to achieve better scheduling solution. One of the prominent intelligent algorithms is ant colony system (ACS) which is implemented widely to solve various types of scheduling problems. However, ACS suffers from stagnation problem in medium and large size grid computing system. ACS is based on exploitation and exploration mechanisms where the exploitation is sufficient but the exploration has a deficiency. The exploration in ACS is based on a random approach without any strategy. This study proposed four hybrid algorithms between ACS, Genetic Algorithm (GA), and Tabu Search (TS) algorithms to enhance the ACS performance. The algorithms are ACS(GA), ACS+GA, ACS(TS), and ACS+TS. These proposed hybrid algorithms will enhance ACS in terms of exploration mechanism and solution refinement by implementing low and high levels hybridization of ACS, GA, and TS algorithms. The proposed algorithms were evaluated against twelve metaheuristic algorithms in static (expected time to compute model) and dynamic (distribution pattern) grid computing environments. A simulator called ExSim was developed to mimic the static and dynamic nature of the grid computing. Experimental results show that the proposed algorithms outperform ACS in terms of best makespan values. Performance of ACS(GA), ACS+GA, ACS(TS), and ACS+TS are better than ACS by 0.35%, 2.03%, 4.65% and 6.99% respectively for static environment. For dynamic environment, performance of ACS(GA), ACS+GA, ACS+TS, and ACS(TS) are better than ACS by 0.01%, 0.56%, 1.16%, and 1.26% respectively. The proposed algorithms can be used to schedule tasks in grid computing with better performance in terms of makespan

    CMS workflow execution using intelligent job scheduling and data access strategies

    Get PDF
    Complex scientific workflows can process large amounts of data using thousands of tasks. The turnaround times of these workflows are often affected by various latencies such as the resource discovery, scheduling and data access latencies for the individual workflow processes or actors. Minimizing these latencies will improve the overall execution time of a workflow and thus lead to a more efficient and robust processing environment. In this paper, we propose a pilot job concept that has intelligent data reuse and job execution strategies to minimize the scheduling, queuing, execution and data access latencies. The results have shown that significant improvements in the overall turnaround time of a workflow can be achieved with this approach. The proposed approach has been evaluated, first using the CMS Tier0 data processing workflow, and then simulating the workflows to evaluate its effectiveness in a controlled environment. © 2011 IEEE

    Job Failure Analysis and Its Implications in a Large-Scale Production Grid

    No full text
    In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interarrival times and life spans of failed jobs. Different failure types are distinguished and the analysis is carried out further at the Virtual Organization (VO) level. The spatial behavior, namely where job failures occur in the Grid, is also examined. Cross-correlation structures, including how arrivals correlate with life spans of job failures, are analyzed and illustrated. We further investigate statistical models to fit the failure data and propose several failureaware scheduling strategies at the Grid level. Our results show that the overall failure rates in the Grid are quite significant, ranging from 25 % to 33 % of all submitted jobs. However, only 5 % to 8 % of the jobs failed after running on a certain Computing Element (CE). The rest of failed jobs are aborted or cancelled without running. A majority of failed jobs come from several large production VOs and a large amount of these failures are centered around several main CEs. The interarrival time processes of failed jobs are shown to be bursty, and the life spans exhibit strong autocorrelations. Based on the failure patterns we argue that it is important for the Grid resource brokers to track historical failure and take it into account in decision making. Some proactive measures and accountability issues are also discussed.

    Holistic cloud computing environmental quantification and behavioural analysis

    Get PDF
    Cloud computing has been characterized to be large-scale multi-tenant systems that are able to dynamically scale-up and scale-down computational resources to consumers with diverse Quality-of-Service requirements. In recent years, a number of dependability and resource management approaches have been proposed for Cloud computing datacenters. However, there is still a lack of real-world Cloud datasets that analyse and extensively model Cloud computing characteristics and quantify their effect on system dimensions such as resource utilization, user behavioural patterns and failure characteristics. This results in two research problems: First, without the holistic analysis of real-world systems Cloud characteristics, their dimensions cannot be quantified resulting in inaccurate research assumptions of Cloud system behaviour. Second, simulated parameters used in state-of-the-art Cloud mechanisms currently rely on theoretical values which do not accurately represent real Cloud systems, as important parameters such as failure times and energy-waste have not been quantified using empirical data. This presents a large gap in terms of practicality and effectiveness between developing and evaluating mechanisms within simulated and real Cloud systems. This thesis presents a comprehensive method and empirical analysis of large-scale production Cloud computing environments in order to quantify system characteristics in terms of consumer submission and resource request patterns, workload behaviour, server utilization and failures. Furthermore, this work identifies areas of operational inefficiency within the system, as well as quantifies the amount of energy waste created due to failures. We discover that 4-10% of all server computation is wasted due to Termination Events, and that failures contribute to approximately 11% of the total datacenter energy waste. These analyses of empirical data enables researchers and Cloud providers an enhanced understanding of real Cloud behaviour and supports system assumptions and provides parameters that can be used to develop and validate the effectiveness of future energy-efficient and dependability mechanisms

    A prescriptive analytics approach for energy efficiency in datacentres.

    Get PDF
    Given the evolution of Cloud Computing in recent years, users and clients adopting Cloud Computing for both personal and business needs have increased at an unprecedented scale. This has naturally led to the increased deployments and implementations of Cloud datacentres across the globe. As a consequence of this increasing adoption of Cloud Computing, Cloud datacentres are witnessed to be massive energy consumers and environmental polluters. Whilst the energy implications of Cloud datacentres are being addressed from various research perspectives, predicting the future trend and behaviours of workloads at the datacentres thereby reducing the active server resources is one particular dimension of green computing gaining the interests of researchers and Cloud providers. However, this includes various practical and analytical challenges imposed by the increased dynamism of Cloud systems. The behavioural characteristics of Cloud workloads and users are still not perfectly clear which restrains the reliability of the prediction accuracy of existing research works in this context. To this end, this thesis presents a comprehensive descriptive analytics of Cloud workload and user behaviours, uncovering the cause and energy related implications of Cloud Computing. Furthermore, the characteristics of Cloud workloads and users including latency levels, job heterogeneity, user dynamicity, straggling task behaviours, energy implications of stragglers, job execution and termination patterns and the inherent periodicity among Cloud workload and user behaviours have been empirically presented. Driven by descriptive analytics, a novel user behaviour forecasting framework has been developed, aimed at a tri-fold forecast of user behaviours including the session duration of users, anticipated number of submissions and the arrival trend of the incoming workloads. Furthermore, a novel resource optimisation framework has been proposed to avail the most optimum level of resources for executing jobs with reduced server energy expenditures and job terminations. This optimisation framework encompasses a resource estimation module to predict the anticipated resource consumption level for the arrived jobs and a classification module to classify tasks based on their resource intensiveness. Both the proposed frameworks have been verified theoretically and tested experimentally based on Google Cloud trace logs. Experimental analysis demonstrates the effectiveness of the proposed framework in terms of the achieved reliability of the forecast results and in reducing the server energy expenditures spent towards executing jobs at the datacentres.N/
    corecore