    Making distributed computing infrastructures interoperable and accessible for e-scientists at the level of computational workflows

    As distributed computing infrastructures evolve, and as their take up by user communities is growing, the importance of making different types of infrastructures based on a heterogeneous set of middleware interoperable is becoming crucial. This PhD submission, based on twenty scientific publications, presents a unique solution to the challenge of the seamless interoperation of distributed computing infrastructures at the level of workflows. The submission investigates workflow level interoperation inside a particular workflow system (intra-workflow interoperation), and also between different workflow solutions (inter-workflow interoperation). In both cases the interoperation of workflow component execution and the feeding of data into these components workflow components are considered. The invented and developed framework enables the execution of legacy applications and grid jobs and services on multiple grid systems, the feeding of data from heterogeneous file and data storage solutions to these workflow components, and the embedding of non-native workflows to a hosting meta-workflow. Moreover, the solution provides a high level user interface that enables e-scientist end-users to conveniently access the interoperable grid solutions without requiring them to study or understand the technical details of the underlying infrastructure. The candidate has also developed an application porting methodology that enables the systematic porting of applications to interoperable and interconnected grid infrastructures, and facilitates the exploitation of the above technical framework

    Contributions to Desktop Grid Computing : From High Throughput Computing to Data-Intensive Sciences on Hybrid Distributed Computing Infrastructures

    Since the mid 90’s, Desktop Grid Computing - i.e the idea of using a large number of remote PCs distributed on the Internet to execute large parallel applications - has proved to be an efficient paradigm to provide a large computational power at the fraction of the cost of a dedicated computing infrastructure.This document presents my contributions over the last decade to broaden the scope of Desktop Grid Computing. My research has followed three different directions. The first direction has established new methods to observe and characterize Desktop Grid resources and developed experimental platforms to test and validate our approach in conditions close to reality. The second line of research has focused on integrating Desk- top Grids in e-science Grid infrastructure (e.g. EGI), which requires to address many challenges such as security, scheduling, quality of service, and more. The third direction has investigated how to support large-scale data management and data intensive applica- tions on such infrastructures, including support for the new and emerging data-oriented programming models.This manuscript not only reports on the scientific achievements and the technologies developed to support our objectives, but also on the international collaborations and projects I have been involved in, as well as the scientific mentoring which motivates my candidature for the Habilitation `a Diriger les Recherches

    CATNETS Final Activity Report

    Developing a distributed electronic health-record store for India

    The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India

    Construction d'un système d'exploitation fondé sur Linux pour le support des organisations virtuelles dans les grilles de nouvelle génération

    This document comprises the final report on the IST Integrated Project XtreemOS - "Building and promotinga Linux-based operating systems to support virtual organizations for next generation Grids".The project started in June 2006 and ended in September 2010.The XtreemOS operating system provides for Grids what a traditional operating system offers fora single computer: abstraction from the hardware and secure resource sharing between different users.It thus simplifies the work of users belonging to virtual organizations by giving them the illusion ofusing a traditional computer while removing the burden of complex resource management issues of atypical Grid environment.We have developed a comprehensive set of cooperating system services. XtreemOS softwarecomponents range from Linux kernel modules to application-support libraries. The XtreemOS operatingsystem provides three major distributed services to users: application execution management(providing scalable resource discovery and job scheduling for distributed interactive applications),data management (accessing and storing data in XtreemFS, a POSIX-like file system spanning theGrid) and virtual organization management (building and operating dynamic virtual organizations).Three flavours of the system have been implemented for individual PC, clusters and mobile devices(PDA, smartphone, notebook).The XtreemOS software has been experimented and validated with a wide range of applications.Various demonstrators were implemented, shown at different events and published on the web.The project results are available as open source software. The consortium member organizationsplan to exploit some of the results in follow-up research projects and in future products.1

    Behavioural Models for Distributed Fractal Components

    This paper presents a formal behavioural specification framework together with its applications in different contexts for specifying and verifying the correct behaviour of distributed Fractal components. Our framework allows us to build behavioural models for applications ranging from sequential Fractal components, to distributed objects, and finally distributed components. Our models are able to characterise both functional and non-functional behaviours, and the interaction between the two concerns. Finally, this work has resulted in the development of tools allowing the non-expert programmer to specify the behaviour of his components, and automatically, or semi-automatically verify properties of his application

    Towards Highly Available and Self-Healing Grid Services

    The volatility of nodes in large scale distributed systems endangers the availability of grid services and makes them difficult to use. In such a context, structured peer-to-peer overlays can be used to provide scalable and fault tolerant communication mechanisms. To ensure the availability of services, active replication can be used on top of the overlays. In this paper, we present Semias, a framework that is based on active replication on top of a structured overlay to provide high availability and self-healing for stateful grid services. The self-healing mechanisms of Semias ensure the high availability of the replicated services while minimizing the number of reconfigurations. We have used Semias to make Vigne grid middleware services highly available. Experiments run on Grid'5000 and PlanetLab show the performance and self-healing properties of the framework

    Resiliency in Distributed Workflows

    In this report we present a thorough study of the concept of resiliency in distributed workflow systems. We focus particularly in applying this concept in fields like numerical optimization, where any software or logical error could mean restarting the entire experiment. A theoretical study is presented along with a set of software tools for implementation directions. At the end a resilient algorithm schema is proposed for later refinement and implementation.Dans ce rapport, nous présentons une étude approfondie de la notion de résilience dans les systèmes de workflow distribué. On a comme objectif particulier l'application de ce concept dans des domaines comme l'optimisation numérique, dont les erreurs des logiciels ou logiques pourraient signifier le redémarrage de l'expérience entière. Une étude théorique est présentée avec un ensemble d'outils logiciels pour la mise en oeuvre. En fin, un schéma d'un algorithme de résilience est proposé pour être raffiné et mise en oeuvre plus tard

    Proactive software rejuvenation solution for web enviroments on virtualized platforms

    The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

    A job response time prediction method for production Grid computing environments

    A major obstacle to the widespread adoption of Grid Computing in both the scientific community and industry sector is the difficulty of knowing in advance a job submission running cost that can be used to plan a correct allocation of resources. Traditional distributed computing solutions take advantage of homogeneous and open environments to propose prediction methods that use a detailed analysis of the hardware and software components. However, production Grid computing environments, which are large and use a complex and dynamic set of resources, present a different challenge. In Grid computing the source code of applications, programme libraries, and third-party software are not always available. In addition, Grid security policies may not agree to run hardware or software analysis tools to generate Grid components models. The objective of this research is the prediction of a job response time in production Grid computing environments. The solution is inspired by the concept of predicting future Grid behaviours based on previous experiences learned from heterogeneous Grid workload trace data. The research objective was selected with the aim of improving the Grid resource usability and the administration of Grid environments. The predicted data can be used to allocate resources in advance and inform forecasted finishing time and running costs before submission. The proposed Grid Computing Response Time Prediction (GRTP) method implements several internal stages where the workload traces are mined to produce a response time prediction for a given job. In addition, the GRTP method assesses the predicted result against the actual target job’s response time to inference information that is used to tune the methods setting parameters. The GRTP method was implemented and tested using a cross-validation technique to assess how the proposed solution generalises to independent data sets. The training set was taken from the Grid environment DAS (Distributed ASCI Supercomputer). The two testing sets were taken from AuverGrid and Grid5000 Grid environments Three consecutive tests assuming stable jobs, unstable jobs, and using a job type method to select the most appropriate prediction function were carried out. The tests offered a significant increase in prediction performance for data mining based methods applied in Grid computing environments. For instance, in Grid5000 the GRTP method answered 77 percent of job prediction requests with an error of less than 10 percent. While in the same environment, the most effective and accurate method using workload traces was only able to predict 32 percent of the cases within the same range of error. The GRTP method was able to handle unexpected changes in resources and services which affect the job response time trends and was able to adapt to new scenarios. The tests showed that the proposed GRTP method is capable of predicting job response time requests and it also improves the prediction quality when compared to other current solutions