93 research outputs found

    Enhancing reliability with Latin Square redundancy on desktop grids.

    Get PDF
    Computational grids are some of the largest computer systems in existence today. Unfortunately they are also, in many cases, the least reliable. This research examines the use of redundancy with permutation as a method of improving reliability in computational grid applications. Three primary avenues are explored - development of a new redundancy model, the Replication and Permutation Paradigm (RPP) for computational grids, development of grid simulation software for testing RPP against other redundancy methods and, finally, running a program on a live grid using RPP. An important part of RPP involves distributing data and tasks across the grid in Latin Square fashion. Two theorems and subsequent proofs regarding Latin Squares are developed. The theorems describe the changing position of symbols between the rows of a standard Latin Square. When a symbol is missing because a column is removed the theorems provide a basis for determining the next row and column where the missing symbol can be found. Interesting in their own right, the theorems have implications for redundancy. In terms of the redundancy model, the theorems allow one to state the maximum makespan in the face of missing computational hosts when using Latin Square redundancy. The simulator software was developed and used to compare different data and task distribution schemes on a simulated grid. The software clearly showed the advantage of running RPP, which resulted in faster completion times in the face of computational host failures. The Latin Square method also fails gracefully in that jobs complete with massive node failure while increasing makespan. Finally an Inductive Logic Program (ILP) for pharmacophore search was executed, using a Latin Square redundancy methodology, on a Condor grid in the Dahlem Lab at the University of Louisville Speed School of Engineering. All jobs completed, even in the face of large numbers of randomly generated computational host failures

    Contributions to Desktop Grid Computing : From High Throughput Computing to Data-Intensive Sciences on Hybrid Distributed Computing Infrastructures

    Get PDF
    Since the mid 90’s, Desktop Grid Computing - i.e the idea of using a large number of remote PCs distributed on the Internet to execute large parallel applications - has proved to be an efficient paradigm to provide a large computational power at the fraction of the cost of a dedicated computing infrastructure.This document presents my contributions over the last decade to broaden the scope of Desktop Grid Computing. My research has followed three different directions. The first direction has established new methods to observe and characterize Desktop Grid resources and developed experimental platforms to test and validate our approach in conditions close to reality. The second line of research has focused on integrating Desk- top Grids in e-science Grid infrastructure (e.g. EGI), which requires to address many challenges such as security, scheduling, quality of service, and more. The third direction has investigated how to support large-scale data management and data intensive applica- tions on such infrastructures, including support for the new and emerging data-oriented programming models.This manuscript not only reports on the scientific achievements and the technologies developed to support our objectives, but also on the international collaborations and projects I have been involved in, as well as the scientific mentoring which motivates my candidature for the Habilitation `a Diriger les Recherches

    Light-Weight Hierarchical Clustering Middleware for Public-Resource Computing

    Get PDF
    The goal of this work was to investigate ways to implement and improve a public-resource computing middleware. Specifically, to make hosting a public-resource computing project logistically simpler and to examine the affect of hierarchical clustering on bandwidth utilization at the central server. To this end, we present the architecture for our cross-platform, multithreaded public-resource computing middleware. Implementing and debugging the middleware proved far more challenging than initially anticipated. As hard as debugging multithreaded programs is, our experience has shown us that it can be leveraged to simplify system components. Our main contribution is the final system architecture.Computer Science Departmen

    Economic-based Distributed Resource Management and Scheduling for Grid Computing

    Full text link
    Computational Grids, emerging as an infrastructure for next generation computing, enable the sharing, selection, and aggregation of geographically distributed resources for solving large-scale problems in science, engineering, and commerce. As the resources in the Grid are heterogeneous and geographically distributed with varying availability and a variety of usage and cost policies for diverse users at different times and, priorities as well as goals that vary with time. The management of resources and application scheduling in such a large and distributed environment is a complex task. This thesis proposes a distributed computational economy as an effective metaphor for the management of resources and application scheduling. It proposes an architectural framework that supports resource trading and quality of services based scheduling. It enables the regulation of supply and demand for resources and provides an incentive for resource owners for participating in the Grid and motives the users to trade-off between the deadline, budget, and the required level of quality of service. The thesis demonstrates the capability of economic-based systems for peer-to-peer distributed computing by developing users' quality-of-service requirements driven scheduling strategies and algorithms. It demonstrates their effectiveness by performing scheduling experiments on the World-Wide Grid for solving parameter sweep applications

    Ant colony optimization algorithm for load balancing in grid computing

    Get PDF
    Managing resources in grid computing system is complicated due to the distributed and heterogeneous nature of the resources. This research proposes an enhancement of the ant colony optimization algorithm that caters for dynamic scheduling and load balancing in the grid computing system. The proposed algorithm is known as the enhance ant colony optimization (EACO). The algorithm consists of three new mechanisms that organize the work of an ant colony i.e. initial pheromone value mechanism, resource selection mechanism and pheromone update mechanism. The resource allocation problem is modelled as a graph that can be used by the ant to deliver its pheromone.This graph consists of four types of vertices which are job, requirement, resource and capacity that are used in constructing the grid resource management element. The proposed EACO algorithm takes into consideration the capacity of resources and the characteristics of jobs in determining the best resource to process a job. EACO selects the resources based on the pheromone value on each resource which is recorded in a matrix form. The initial pheromone value of each resource for each job is calculated based on the estimated transmission time and execution time of a given job.Resources with high pheromone value are selected to process the submitted jobs. Global pheromone update is performed after the completion of processing the jobs in order to reduce the pheromone value of resources.A simulation environment was developed using Java programming to test the performance of the proposed EACO algorithm against other ant based algorithm, in terms of resource utilization. Experimental results show that EACO produced better grid resource management solution

    NetJobs: A new approach to network monitoring for the Grid using Grid jobs

    Get PDF
    With grid computing, the far-fl�ung and disparate IT resources act as a single "virtual datacenter". Grid computing interfaces heterogeneous IT resources so they are available when and where we need them. Grid allows us to provision applications and allocate capacity among research and business groups that are geographically and organizationally dispersed. Building a high availability Grid is hold as the next goal to achieve: protecting against computer failures and site failures to avoid downtime of resource and honor Service Level Agreements. Network monitoring has a key role in this challenge. This work is concerning the design and the prototypal implementation of a new approach to Network monitoring for the Grid based on the usage of Grid scheduled jobs. This work was carried out within the Network Support task (SA2) of the Enabling Grids for E-sciencE (EGEE) project. This thesis is organized as follows: Chapter 1: Grid Computing From the origins of Grid Computing to the latest projects. Conceptual framework and main features characterizing many kind of popular grids will be presented. Chapter 2: The EGEE and EGI projects This chapter describes the Enabling Grids for E-sciencE (EGEE) project and the European Grid Infrastructure (EGI). EGEE project (2004-2010) was the�flagship Grid infrastructure project of the EU. The third and last two-year phase of the project (started on 1 May 2008) was financed with a total budget of around 47 million euro, with a further estimated 50 million euro worth of computing resources contributed by the partners. A total manpower of 9,000 Person Months, of which over 4,500 Person Months has been contributed by the partners from their own funding sources. At its close, EGEE represented a worldwide infrastructure of approximately to 200,000 CPU cores, collaboratively hosted by more than 300 centres around the world. By the end of the project, around 13 million jobs were executed on the EGEE grid each month. The new organization, EGI.eu, has then been created to continue the coordination and evolution of the European Grid Infrastructure (EGI) based on EGEE Grid. Chapter3: gLite Middleware Chapter three gives an overview on the gLite Grid Middleware. gLite is the middleware stack for grid computing used by the EGEE and EGI projects with in a very large variety of scientifi�c domains. Born from the collaborative efforts of more than 80 people in 12 different academic and industrial research centers as part of the EGEE Project, gLite provides a complete set of services for building a production grid infrastructure. gLite provides a framework for building grid applications tapping into the power of distributed computing and storage resources across the Internet. The gLite services are currently adopted by more than 250 Computing Centres and used by more than 15000 researchers in Europe and around the world. Chapter 4: Network Activity in EGEE/EGI Grid infrastructures are distributed by nature, involving many sites, normally in different administrative domains. Individual sites are connected together by a network, which is therefore a critical part of the whole Grid infrastructure; without the network there is no Grid. Monitoring is a key component for the successful operation of any infrastructure, helping in the discovery and diagnosis of any problem which may arise. Network monitoring is able to contribute to the day-to-day operations of the Grid by helping to provide answers to specific questions from users and site administrators. This chapter will discuss all the effort lavished by EGEE and EGI in the Grid Network domain. Chapter 5: Grid Network Monitoring based on Grid Jobs Net Jobs is a prototype of a light weight solution for the Grid network monitoring. A job-based approach has been used in order to prove the feasibility of this non intrusive solution. It is currently configured to monitor eight production sites spread from Italy to France but this method could be applied to the vast majority of Grid sites. The prototype provides coherent RTT, MTU, number of hops and TCP achievable bandwidth tests

    Scheduling for Large Scale Distributed Computing Systems: Approaches and Performance Evaluation Issues

    Get PDF
    Although our everyday life and society now depends heavily oncommunication infrastructures and computation infrastructures,scientists and engineers have always been among the main consumers ofcomputing power. This document provides a coherent overview of theresearch I have conducted in the last 15 years and which targets themanagement and performance evaluation of large scale distributedcomputing infrastructures such as clusters, grids, desktop grids,volunteer computing platforms, ... when used for scientific computing.In the first part of this document, I present how I have addressedscheduling problems arising on distributed platforms (like computinggrids) with a particular emphasis on heterogeneity and multi-userissues, hence in connection with game theory. Most of these problemsare relaxed from a classical combinatorial optimization formulationinto a continuous form, which allows to easily account for keyplatform characteristics such as heterogeneity or complex topologywhile providing efficient practical and distributed solutions.The second part presents my main contributions to the SimGrid project,which is a simulation toolkit for building simulators of distributedapplications (originally designed for scheduling algorithm evaluationpurposes). It comprises a unified presentation of how the questions ofvalidation and scalability have been addressed in SimGrid as well asthoughts on specific challenges related to methodological aspects andto the application of SimGrid to the HPC context

    Scheduling and synchronization for multicore concurrency platforms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 217-230).Developing correct and efficient parallel programs is difficult since programmers often have to manage low-level details like scheduling and synchronization explicitly. Recently, however, many hardware vendors have been shifting towards building multicore computers. This trend creates an enormous pressure to create concurrency platforms - platforms that provide an easier interface for parallel programming and enable ordinary programmers to write scalable, portable and efficient parallel programs. This thesis provides some provably-good practical solutions to problems that arise in the implementation of concurrency platforms, particularly in the domain of scheduling and synchronization. The first part of this thesis describes work on scheduling of parallel programs written in dynamic multithreaded languages (such as Cilk, Hood etc.). These languages allow the programmer to express parallelism of their code in a natural manner, while an automatic scheduler in the concurrency platform is responsible for scheduling the program on the underlying parallel hardware. This thesis presents designs to increase the functionality of these concurrency platforms. The second part of the thesis presents work on transactional memory semantics and design. Transactional memory (TM), has been recently proposed as an alternative to locks. TM provides a transactional interface to memory. The programmers can specify their critical sections inside a transaction, and the TM concurrency platform guarantees that the region executes atomically. One of the purported advantages of TM over locks is that transactional code is composable.(cont.) Most of the current TM concurrency platforms do not support full composability, however. This thesis addresses two of the composability problems in existing TM concurrency platforms.by Kunal Agrawal.Ph.D
    • …
    corecore