14 research outputs found

    A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds

    Full text link
    Cloud platforms have emerged as a prominent environment to execute high performance computing (HPC) applications providing on-demand resources as well as scalability. They usually offer different classes of Virtual Machines (VMs) which ensure different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for lower price. Despite the monetary advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any moment. Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we propose in this paper a static scheduling for HPC applications which are composed by independent tasks (bag-of-task) with deadline constraints. However, if a spot VM hibernates and it does not resume within a time which guarantees the application's deadline, a temporal failure takes place. Our scheduling, thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2 cloud, respecting its deadline and avoiding temporal failures. To this end, our algorithm statically creates two scheduling maps: (i) the first one contains, for each task, its starting time and on which VM (i.e., an available spot or on-demand VM with the current lowest price) the task should execute; (ii) the second one contains, for each task allocated on a VM spot in the first map, its starting time and on which on-demand VM it should be executed to meet the application deadline in order to avoid temporal failures. The latter will be used whenever the hibernation period of a spot VM exceeds a time limit. Performance results from simulation with task execution traces, configuration of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of our scheduling and that it tolerates temporal failures

    Proactive software rejuvenation solution for web enviroments on virtualized platforms

    Get PDF
    The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

    Programming and parallelising applications for distributed infrastructures

    Get PDF
    The last decade has witnessed unprecedented changes in parallel and distributed infrastructures. Due to the diminished gains in processor performance from increasing clock frequency, manufacturers have moved from uniprocessor architectures to multicores; as a result, clusters of computers have incorporated such new CPU designs. Furthermore, the ever-growing need of scienti c applications for computing and storage capabilities has motivated the appearance of grids: geographically-distributed, multi-domain infrastructures based on sharing of resources to accomplish large and complex tasks. More recently, clouds have emerged by combining virtualisation technologies, service-orientation and business models to deliver IT resources on demand over the Internet. The size and complexity of these new infrastructures poses a challenge for programmers to exploit them. On the one hand, some of the di culties are inherent to concurrent and distributed programming themselves, e.g. dealing with thread creation and synchronisation, messaging, data partitioning and transfer, etc. On the other hand, other issues are related to the singularities of each scenario, like the heterogeneity of Grid middleware and resources or the risk of vendor lock-in when writing an application for a particular Cloud provider. In the face of such a challenge, programming productivity - understood as a tradeo between programmability and performance - has become crucial for software developers. There is a strong need for high-productivity programming models and languages, which should provide simple means for writing parallel and distributed applications that can run on current infrastructures without sacri cing performance. In that sense, this thesis contributes with Java StarSs, a programming model and runtime system for developing and parallelising Java applications on distributed infrastructures. The model has two key features: first, the user programs in a fully-sequential standard-Java fashion - no parallel construct, API call or pragma must be included in the application code; second, it is completely infrastructure-unaware, i.e. programs do not contain any details about deployment or resource management, so that the same application can run in di erent infrastructures with no changes. The only requirement for the user is to select the application tasks, which are the model's unit of parallelism. Tasks can be either regular Java methods or web service operations, and they can handle any data type supported by the Java language, namely les, objects, arrays and primitives. For the sake of simplicity of the model, Java StarSs shifts the burden of parallelisation from the programmer to the runtime system. The runtime is responsible from modifying the original application to make it create asynchronous tasks and synchronise data accesses from the main program. Moreover, the implicit inter-task concurrency is automatically found as the application executes, thanks to a data dependency detection mechanism that integrates all the Java data types. This thesis provides a fairly comprehensive evaluation of Java StarSs on three di erent distributed scenarios: Grid, Cluster and Cloud. For each of them, a runtime system was designed and implemented to exploit their particular characteristics as well as to address their issues, while keeping the infrastructure unawareness of the programming model. The evaluation compares Java StarSs against state-of-the-art solutions, both in terms of programmability and performance, and demonstrates how the model can bring remarkable productivity to programmers of parallel distributed applications

    An adaptive trust based service quality monitoring mechanism for cloud computing

    Get PDF
    Cloud computing is the newest paradigm in distributed computing that delivers computing resources over the Internet as services. Due to the attractiveness of cloud computing, the market is currently flooded with many service providers. This has necessitated the customers to identify the right one meeting their requirements in terms of service quality. The existing monitoring of service quality has been limited only to quantification in cloud computing. On the other hand, the continuous improvement and distribution of service quality scores have been implemented in other distributed computing paradigms but not specifically for cloud computing. This research investigates the methods and proposes mechanisms for quantifying and ranking the service quality of service providers. The solution proposed in this thesis consists of three mechanisms, namely service quality modeling mechanism, adaptive trust computing mechanism and trust distribution mechanism for cloud computing. The Design Research Methodology (DRM) has been modified by adding phases, means and methods, and probable outcomes. This modified DRM is used throughout this study. The mechanisms were developed and tested gradually until the expected outcome has been achieved. A comprehensive set of experiments were carried out in a simulated environment to validate their effectiveness. The evaluation has been carried out by comparing their performance against the combined trust model and QoS trust model for cloud computing along with the adapted fuzzy theory based trust computing mechanism and super-agent based trust distribution mechanism, which were developed for other distributed systems. The results show that the mechanisms are faster and more stable than the existing solutions in terms of reaching the final trust scores on all three parameters tested. The results presented in this thesis are significant in terms of making cloud computing acceptable to users in verifying the performance of the service providers before making the selection

    A job response time prediction method for production Grid computing environments

    Get PDF
    A major obstacle to the widespread adoption of Grid Computing in both the scientific community and industry sector is the difficulty of knowing in advance a job submission running cost that can be used to plan a correct allocation of resources. Traditional distributed computing solutions take advantage of homogeneous and open environments to propose prediction methods that use a detailed analysis of the hardware and software components. However, production Grid computing environments, which are large and use a complex and dynamic set of resources, present a different challenge. In Grid computing the source code of applications, programme libraries, and third-party software are not always available. In addition, Grid security policies may not agree to run hardware or software analysis tools to generate Grid components models. The objective of this research is the prediction of a job response time in production Grid computing environments. The solution is inspired by the concept of predicting future Grid behaviours based on previous experiences learned from heterogeneous Grid workload trace data. The research objective was selected with the aim of improving the Grid resource usability and the administration of Grid environments. The predicted data can be used to allocate resources in advance and inform forecasted finishing time and running costs before submission. The proposed Grid Computing Response Time Prediction (GRTP) method implements several internal stages where the workload traces are mined to produce a response time prediction for a given job. In addition, the GRTP method assesses the predicted result against the actual target job’s response time to inference information that is used to tune the methods setting parameters. The GRTP method was implemented and tested using a cross-validation technique to assess how the proposed solution generalises to independent data sets. The training set was taken from the Grid environment DAS (Distributed ASCI Supercomputer). The two testing sets were taken from AuverGrid and Grid5000 Grid environments Three consecutive tests assuming stable jobs, unstable jobs, and using a job type method to select the most appropriate prediction function were carried out. The tests offered a significant increase in prediction performance for data mining based methods applied in Grid computing environments. For instance, in Grid5000 the GRTP method answered 77 percent of job prediction requests with an error of less than 10 percent. While in the same environment, the most effective and accurate method using workload traces was only able to predict 32 percent of the cases within the same range of error. The GRTP method was able to handle unexpected changes in resources and services which affect the job response time trends and was able to adapt to new scenarios. The tests showed that the proposed GRTP method is capable of predicting job response time requests and it also improves the prediction quality when compared to other current solutions

    A Hibernation Aware Dynamic Scheduler for Cloud Environments

    Get PDF
    International audienceNowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment. In this work, we propose the Hibernation-Aware Dynamic Scheduler (HADS), to schedule applications composed of independent tasks (bag-of-tasks) with deadline constraints in both hibernation-prone spot VMs (for cost sake) and on-demand VMs. We also consider the problem of temporal failures, that occurs when a spot VM hibernates, and does not resume within a time that guarantees the application's deadline. Our dynamic scheduling approach aims at minimizing the monetary costs of bag-of-tasks applications execution, respecting its deadline even in the presence of hibernation. It is also able to avoid temporal failures, by using task migration and work-stealing techniques. Experimental results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling when compared with on-demand VM only based approaches, in terms of monetary costs and execution times. It is also shown that our strategy can tolerate temporal failures

    Improved self-management of datacenter systems applying machine learning

    Get PDF
    Autonomic Computing is a Computer Science and Technologies research area, originated during mid 2000's. It focuses on optimization and improvement of complex distributed computing systems through self-control and self-management. As distributed computing systems grow in complexity, like multi-datacenter systems in cloud computing, the system operators and architects need more help to understand, design and optimize manually these systems, even more when these systems are distributed along the world and belong to different entities and authorities. Self-management lets these distributed computing systems improve their resource and energy management, a very important issue when resources have a cost, by obtaining, running or maintaining them. Here we propose to improve Autonomic Computing techniques for resource management by applying modeling and prediction methods from Machine Learning and Artificial Intelligence. Machine Learning methods can find accurate models from system behaviors and often intelligible explanations to them, also predict and infer system states and values. These models obtained from automatic learning have the advantage of being easily updated to workload or configuration changes by re-taking examples and re-training the predictors. So employing automatic modeling and predictive abilities, we can find new methods for making "intelligent" decisions and discovering new information and knowledge from systems. This thesis departs from the state of the art, where management is based on administrators expertise, well known data, ad-hoc studied algorithms and models, and elements to be studied from computing machine point of view; to a novel state of the art where management is driven by models learned from the same system, providing useful feedback, making up for incomplete, missing or uncertain data, from a global network of datacenters point of view. - First of all, we cover the scenario where the decision maker works knowing all pieces of information from the system: how much will each job consume, how is and will be the desired quality of service, what are the deadlines for the workload, etc. All of this focusing on each component and policy of each element involved in executing these jobs. -Then we focus on the scenario where instead of fixed oracles that provide us information from an expert formula or set of conditions, machine learning is used to create these oracles. Here we look at components and specific details while some part of the information is not known and must be learned and predicted. - We reduce the problem of optimizing resource allocations and requirements for virtualized web-services to a mathematical problem, indicating each factor, variable and element involved, also all the constraints the scheduling process must attend to. The scheduling problem can be modeled as a Mixed Integer Linear Program. Here we face an scenario of a full datacenter, further we introduce some information prediction. - We complement the model by expanding the predicted elements, studying the main resources (this is CPU, Memory and IO) that can suffer from noise, inaccuracy or unavailability. Once learning predictors for certain components let the decision making improve, the system can become more ¿expert-knowledge independent¿ and research can focus on an scenario where all the elements provide noisy, uncertainty or private information. Also we introduce to the management optimization new factors as for each datacenter context and costs may change, turning the model as "multi-datacenter" - Finally, we review of the cost of placing datacenters depending on green energy sources, and distribute the load according to green energy availability

    Semantic resource management and interoperability between distributed computing platforms

    Get PDF
    Distributed Computing is the paradigm where the application execution is distributed across different computers connected by a communication network. Distributed Computing platforms have evolved very fast during the las decades: starting from Clusters, where a set of computers were working together in a single location; then evolving to the Grids, where computing resources are shared by different entities, creating a global computing infrastructure which is available to different user communities; and finally becoming in what is currently known as the Cloud, where computing and data resources are provided, on demand, in a very dynamic fashion, and following the Utility Computing model where you pay only for what you consume. Different types of companies and institutions are exploring the potential benefits of moving their IT services and applications to Cloud infrastructures, in order to decouple the management of computing resources from their core business process to become more productive. Nevertheless, migrating software to Clouds is not an easy task, since it requires a deep knowledge of the technology to decompose the application and the capabilities offered by providers and how to use them. Besides this complex deployment process, the current cloud market place has several providers offering resources with different capabilities, prices and quality, and each provider uses their own properties and APIs for describing and accessing their resources. Therefore, when customers want to execute an application in the providers' resources, they must understand the different providers' description, compare them and select the most suitable resources for their interests. Once the provider and resources have been selected, developers have to inter-operate with the different providers' interfaces to perform the application execution steps. To do all the mentioned steps, application developers have to deal with the design and implementation of complex integration procedures. This thesis presents several contributions to overcome the aforementioned problems by providing a platform that facilitates and automates the integration of applications in different providers' infrastructures lowering the barrier of adopting new distributed computing infrastructure such as Clouds. The achievement of this objective has been split in several parts. In the first part, we have studied how semantic web technologies are helping to describe applications and to automatically infer a model for deploying them in a distributed platform. Once the application deployment model has been inferred, the second step is finding the resources to deploy and execute the different application components. Regarding this topic, we have studied how semantic web technologies can be applied in the resource allocation problem. Once the different components have been allocated in the providers' resources, it is time to deploy and execute the application components on these resources by invoking a workflow of provider API calls. However, every provider defines their own management interfaces, so the workflow to perform the same actions is different depending on the selected provider. In this thesis, we propose a framework to automatically infer the workflow of provider interface calls required to perform any resource management tasks. In the last part of the thesis, we have studied how to introduce the benefits of software agents for coordinating the application management in distributed platforms. We propose a multi-agent system which is in charge of coordinating the different steps of the application deployment in a distributed way as well as monitoring the correct execution of the application in the computing resources. The different contributions have been validated with a prototype implementation and a set of use cases.La Computación Distribuida es un paradigma donde la ejecución de aplicaciones se distribuye entre diferentes computadores contados a través de una red de comunicación. Las plataformas de computación distribuida han evolucionado rápidamente durante las últimas décadas, empezando por los "Clusters", donde varios computadores están conectados por una red local; pasando por los "Grids", donde los recursos computacionales son compartidos por varias instituciones creando un red de computación global; llegando finalmente a lo que actualmente conocemos como "Clouds", donde nos podemos proveer de recursos de manera dinámica, bajo demanda y pagando solo por lo que consumimos. Actualmente, varias compañías están descubriendo los beneficios de mover sus aplicaciones a las infraestructuras Cloud, desacoplando la administración de los recursos computacionales de su "core business" para ser más productivos. Sin embargo migrar el software al Cloud no es una tarea fácil porque se requiere un conocimiento exhaustivo de la tecnología y como usar los servicios ofrecidos por los diferentes proveedores. Además cada proveedor ofrece recursos con diferentes capacidades, precios y calidades, con su propia interfaz para acceder a ellos. Por consiguiente, cuando un usuario quiere ejecutar una aplicación en el Cloud, debe entender que ofrece cada proveedor y como usarlo y una vez que ha elegido debe programar los diferentes pasos del despliegue de su aplicación. Si además se quieren usar varios proveedores o cambiar a otro, este proceso debe repetirse varias veces. Esta tesis presenta varias contribuciones para mitigar estos problemas diseñando una plataforma para facilitar y automatizar la integración de aplicaciones en los diferentes proveedores. Estas contribuciones se dividen en varias partes: Primero, el estudio de como las tecnologías semánticas pueden ayudar para describir aplicaciones y automáticamente inferir como se puede desplegar en un plataforma distribuida. Una vez obtenemos este modelo de despliegue, la segunda contribución nos presenta como estas mismas tecnologías pueden usarse para asignar las diferentes partes del despliegue de la aplicación a los recursos de los proveedores. Una vez sabemos la asignación, la siguiente contribución nos resuelve como se puede usar "AI planning" para encontrar la secuencia de servicios que se deben ejecutar para realizar el despliegue deseado. Finalmente, la última parte de la tesis, nos presenta como el despliegue y ejecuciones de las aplicaciones puede coordinarse por un sistema multi-agentes de una manera escalable y distribuida. Las diferentes contribuciones de la tesis han sido validadas mediante la implementación de prototipos y casos de uso
    corecore