11 research outputs found
Discovering Job Preemptions in the Open Science Grid
The Open Science Grid(OSG) is a world-wide computing system which facilitates
distributed computing for scientific research. It can distribute a
computationally intensive job to geo-distributed clusters and process job's
tasks in parallel. For compute clusters on the OSG, physical resources may be
shared between OSG and cluster's local user-submitted jobs, with local jobs
preempting OSG-based ones. As a result, job preemptions occur frequently in
OSG, sometimes significantly delaying job completion time.
We have collected job data from OSG over a period of more than 80 days. We
present an analysis of the data, characterizing the preemption patterns and
different types of jobs. Based on observations, we have grouped OSG jobs into 5
categories and analyze the runtime statistics for each category. we further
choose different statistical distributions to estimate probability density
function of job runtime for different classes.Comment: 8 page
Effective Scheduling of Grid Resources Using Failure Prediction
In large-scale grid environments, accurate failure prediction is critical to achieve effective resource allocation while assuring specified QoS levels, such as reliability. Traditional methods, such as statistical estimation techniques, can be considered to predict the reliability of resources. However, naive statistical methods often ignore critical characteristic behavior of the resources. In particular, periodic behaviors of grid resources are not captured well by statistical methods. In this paper, we present an alternative mechanism for failure prediction. In our approach, the periodic pattern of resource failures are determined and actively exploited for resource allocation with better QoS guarantees. The proposed scheme is evaluated under a realistic simulation environment of computational grids. The availability of computing resources are simulated according to real trace that was collected from our large-scale monitoring experiment on campus computers. Our evaluation results show that the proposed approach enables significantly higher resource scheduling effectiveness under a variety of workloads compared to baseline approaches
Adaps – A three-phase adaptive prediction system for the run-time of jobs based on user behaviour
AbstractIn heterogeneous and distributed environments it is necessary to create schedules for utilising resources in an efficient way. This generation often poses a problem for a scheduler, since several aspects have to be considered. One way of supporting a scheduler is to provide accurate predictions of the run-times of the submitted jobs. A large number of current techniques offer statistical models that are deployed on previously filtered data. As users have different jobs, and because the attributes of their jobs differ, filtering data and choosing an appropriate prediction method has to cover these aspects. This article describes Adaps, a system for run-time prediction that works in three phases. Each is independently adjusting to the jobs of a user, based on historical information. This leads to a user specific clustering of data and to a flexible utilisation of different prediction techniques in order to create a user-centred prediction model
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
Una aproximación evolutiva a la planificación en entornos HPC basada en la incorporación de criterios subjetivos
[Resumen]
En el contexto de un centro de supercomputación, por muy elevados que sean
los recursos, la demanda será siempre superior. Por ello, los usuarios deben realizar
solicitudes para la ejecución de sus trabajos, que se ponen en espera hasta que el
planificador del sistema decide pasarlos a ejecución. Pero, por desconocimiento o
temor a que los trabajos sean abortados, estas solicitudes son normalmente muy
imprecisas, dificultando la labor del planificador. Además, los planificadores son
difíciles de configurar y en todo momento asumen que una planificación dada va a
satisfacer de igual manera a todos los usuarios.
En este trabajo se propone un sistema de planificación que utiliza técnicas de
computación evolutiva para permitir la definición de políticas de planificación de
manera más natural y estimar las necesidades reales de recursos para lograr planificaciones
más precisas. Adicionalmente, se considera el concepto de calidad de
servicio percibida, posibilitando la incorporación de criterios subjetivos en el proceso
de planificación para mantener un alto nivel de satisfacción en el conjunto
de usuarios y en el propio centro de supercomputación. Finalmente, se modelan
diversos aspectos de los propios recursos computacionales mejorando aún más la
precisión en la planificación, especialmente en sistemas heterogéneos.[Abstract]In the context of a supercomputing center, no matter what its computational resources
are, the demand will always be higher. Therefore, users must send their jobs
to a queue, where they are put on hold until the scheduler decides to execute them.
But, through ignorance or fear that jobs are aborted, these requests are usually very
imprecise, hindering the performance of the scheduler. In addition, schedulers are
difficult to configure and they assume that a given scheduling will satisfy equally
to all users at all times.
This thesis proposes a scheduler for high performance computing systems based
on evolutionary computation techniques to allow the definition of scheduling policies
more naturally and to estimate the real needs of resources in order to achieve
more accurate schedules. Additionally, the concept of perceived quality of service
is considered, enabling the incorporation of subjective criteria in the scheduling
process to maintain a high level of satisfaction in the set of users and in the supercomputing
center itself. Finally, various aspects of the computational resources are
modeled to further improving accuracy in scheduling, especially in heterogeneous
systems.[Resumo]No contexto dun centro de supercomputación, por moi elevados que sexan os
recursos, a demanda será sempre superior. Por elo, os usuarios deben realizar solicitudes
para a execución dos seus traballos, que se poñen en espera ata que o
planificador do sistema decide pasalos a execución. Pero, por descoñecemento ou
temor a que os traballos sexan abortados, estas solicitudes son normalmente moi
imprecisas, dificultando o labor do planificador. Ademais, os planificadores son difíciles
de configurar e en todo momento asumen que unha planificación dada vai
satisfacer de igual maneira a todos os usuarios.
Neste traballo proponse un sistema de planificación que utiliza técnicas de computación
evolutiva para permitir a definición de políticas de planificación de maneira
máis natural e estimar as necesidades reais de recursos para lograr planificacións
máis precisas. Adicionalmente, considérase o concepto de calidade de servizo percibida,
posibilitando a incorporación de criterios subxectivos no proceso de planificación
para manter un alto nivel de satisfacción no conxunto de usuarios e no propio
centro de supercomputación. Finalmente, se modelan diversos aspectos dos propios
recursos computacionáis mellorando aínda máis a precisión na planificación,
especialmente en sistemas heteroxéneos