    Characterizing Result Errors in Internet Desktop Grids

    Desktop grids use the free resources in Intranet and Internet environments for large-scale computation and storage. While desktop grids offer a high return on investment, one critical issue is the validation of results returned by participating hosts. Several mechanisms for result validation have been previously proposed. However, the characterization of errors is poorly understood. To study error rates, we implemented and deployed a desktop grid application across several thousand hosts distributed over the Internet. We then analyzed the results to give quantitative, empirical characterization of errors rates. We find that in practice, error rates are widespread across hosts but occur relatively infrequently. Moreover, we find that error rates tend to not be stationary over time nor correlated between hosts. In light of these characterization results, we evaluated state-of-the-art error detection mechanisms and describe the trade-offs for using each mechanism. Finally, based on our empirical results, we conduct a benefit analysis of a recently proposed mechanism for error detection tailored for long-running applications. This mechanism is based on using the digest of intermediate checkpoints, and we show in theory and simulation that the relative benefit of this method compared to the state-of-the-art is as high as 45\%

    Enhancing reliability with Latin Square redundancy on desktop grids.

    Computational grids are some of the largest computer systems in existence today. Unfortunately they are also, in many cases, the least reliable. This research examines the use of redundancy with permutation as a method of improving reliability in computational grid applications. Three primary avenues are explored - development of a new redundancy model, the Replication and Permutation Paradigm (RPP) for computational grids, development of grid simulation software for testing RPP against other redundancy methods and, finally, running a program on a live grid using RPP. An important part of RPP involves distributing data and tasks across the grid in Latin Square fashion. Two theorems and subsequent proofs regarding Latin Squares are developed. The theorems describe the changing position of symbols between the rows of a standard Latin Square. When a symbol is missing because a column is removed the theorems provide a basis for determining the next row and column where the missing symbol can be found. Interesting in their own right, the theorems have implications for redundancy. In terms of the redundancy model, the theorems allow one to state the maximum makespan in the face of missing computational hosts when using Latin Square redundancy. The simulator software was developed and used to compare different data and task distribution schemes on a simulated grid. The software clearly showed the advantage of running RPP, which resulted in faster completion times in the face of computational host failures. The Latin Square method also fails gracefully in that jobs complete with massive node failure while increasing makespan. Finally an Inductive Logic Program (ILP) for pharmacophore search was executed, using a Latin Square redundancy methodology, on a Condor grid in the Dahlem Lab at the University of Louisville Speed School of Engineering. All jobs completed, even in the face of large numbers of randomly generated computational host failures

    Trace-Driven Simulation for Energy Consumption in High Throughput Computing Systems

    High Throughput Computing (HTC) is a powerful paradigm allowing vast quantities of independent work to be performed simultaneously. However, until recently little evaluation has been performed on the energy impact of HTC. Many organisations now seek to minimise energy consumption across their IT infrastructure though it is unclear how this will affect the usability of HTC systems. We present here HTC-Sim, a simulation system which allows the evaluation of different energy reduction policies across an HTC system comprising a collection of computational resources dedicated to HTC work and resources provided through cycle scavenging -- a Desktop Grid. We demonstrate that our simulation software scales linearly with increasing HTC workload

    Energy-efficient checkpointing in high-throughput cycle-stealing distributed systems

    Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware or software failures as well as interruptions from resource owners and more important tasks. Until recently many researchers have focused on the performance gains achieved through checkpointing, but now with growing scrutiny of the energy consumption of IT infrastructures it is increasingly important to understand the energy impact of checkpointing within an HTC environment. In this paper we demonstrate through trace-driven simulation of real-world datasets that existing checkpointing strategies are inadequate at maintaining an acceptable level of energy consumption whilst maintaing the performance gains expected with checkpointing. Furthermore, we identify factors important in deciding whether to exploit checkpointing within an HTC environment, and propose novel strategies to curtail the energy consumption of checkpointing approaches whist maintaining the performance benefits

    High-fidelity rendering on shared computational resources

    The generation of high-fidelity imagery is a computationally expensive process and parallel computing has been traditionally employed to alleviate this cost. However, traditional parallel rendering has been restricted to expensive shared memory or dedicated distributed processors. In contrast, parallel computing on shared resources such as a computational or a desktop grid, offers a low cost alternative. But, the prevalent rendering systems are currently incapable of seamlessly handling such shared resources as they suffer from high latencies, restricted bandwidth and volatility. A conventional approach of rescheduling failed jobs in a volatile environment inhibits performance by using redundant computations. Instead, clever task subdivision along with image reconstruction techniques provides an unrestrictive fault-tolerance mechanism, which is highly suitable for high-fidelity rendering. This thesis presents novel fault-tolerant parallel rendering algorithms for effectively tapping the enormous inexpensive computational power provided by shared resources. A first of its kind system for fully dynamic high-fidelity interactive rendering on idle resources is presented which is key for providing an immediate feedback to the changes made by a user. The system achieves interactivity by monitoring and adapting computations according to run-time variations in the computational power and employs a spatio-temporal image reconstruction technique for enhancing the visual fidelity. Furthermore, algorithms described for time-constrained offline rendering of still images and animation sequences, make it possible to deliver the results in a user-defined limit. These novel methods enable the employment of variable resources in deadline-driven environments

    Prediction of available computing capacities for a more efficient use of Grid resources

    Vor allem in der Forschung und in den Entwicklungsabteilungen von Unternehmen gibt es eine Vielzahl von Problemen, welche nur mit Programmen zu lösen sind, für deren Ausführung die zur Verfügung stehende Rechenleistung kaum groß genug sein kann. Gleichzeitig ist zu beobachten, dass ein großer Teil der mit der installierten Rechentechnik vorhandenen Rechenkapazität nicht ausgenutzt wird. Dies gilt insbesondere für Einzelrechner, die in Büros, Computer-Pools oder Privathaushalten stehen und sogar während ihrer eigentlichen Nutzung selten ausgelastet sind. Eines der Ziele des Grid-Computings besteht darin, solche nicht ausgelasteten Ressourcen für rechenintensive Anwendungen zur Verfügung zu stellen. Die eigentliche Motivation für die beabsichtigte bessere Auslastung der Ressourcen liegt dabei nicht primär in der höhreren Auslastung, sondern in einer möglichen Einsparung von Kosten gegenüber der Alternative der Neuanschaffung weiterer Hardware. Ein erster Beitrag der vorliegenden Arbeit liegt in der Analyse und Quantifizierung dieses möglichen Kostenvorteils. Zu diesem Zweck werden die relevanten Kosten betrachtet und schließlich verschiedene Szenarien miteinander verglichen. Die Analyse wird schließlich konkrete Zahlen zu den Kosten in den verschiedenen Szenarien liefern und somit das mögliche Potential zur Kosteneinsparung bei der Nutzung brach liegender Rechenkapazitäten aufzeigen. Ein wesentliches Problem beim Grid-Computing besteht jedoch (vor allem bei der Nutzung von Einzelrechnern zur Ausführung länger laufender Programme) darin, dass die zur Verfügung stehenden freien Rechenkapazitäten im zeitlichen Verlauf stark schwanken und Berechnungsfortschritte durch plötzliche anderweitige Verwendung bzw. durch Abschalten der Rechner verloren gehen. Um dennoch auch Einzelrechner sinnvoll für die Ausführung länger laufender Jobs nutzen zu können, wären Vorhersagen der in der nächsten Zeit zu erwartenden freien Rechenkapazitäten wünschenswert. Solche Vorhersagen könnten u. a. hilfreich sein für das Scheduling und für die Bestimmung geeigneter Checkpoint-Zeitpunkte. Für die genannten Anwendungszwecke sind dabei Punktvorhersagen (wie z. B. Vorhersagen des Erwartungswertes) nur bedingt hilfreich, weshalb sich die vorliegende Arbeit ausschließlich mit Vorhersagen der Wahrscheinlichkeitsverteilungen beschäftigt. Wie solche Vorhersagen erstellt werden sollen, ist Gegenstand der restlichen Arbeit. Dabei werden zunächst Möglichkeiten der Bewertung von Prognoseverfahren diskutiert, die Wahrscheinlichkeitsverteilungen vorhersagen. Es werden wesentliche Probleme bisheriger Bewertungsverfahren aufgezeigt und entsprechende Lösungsvorschläge gemacht. Unter Nutzung dieser werden in der Literatur zu findende und auch neue Vorgehensweisen zur Prognoseerstellung empirisch miteinander verglichen. Es wird sich zeigen, dass eine der neu entwickelten Vorgehensweisen im Vergleich zu bisher in der Literatur dargestellten Vorhersageverfahren klare Vorteile bzgl. der Genauigkeit der Prognosen erzielt.Although computer hardware is getting faster and faster, the available computing capacity is not satisfying for all problem types. Especially in research and development departments the demand for computing power is nearly unlimited. On the same time, there are really large amounts of computing capacities being idle. Such idle capacities can be found in computer pools, on office workstations, or even on home PCs, which are rarely fully utilized. Consequently, one of the goals of the so called “grid computing” is the use of underutilized resources for the execution of compute-intensive tasks. The original motivation behind this idea is not primarily the high utilization of all resources. Instead, the goal is a reduction of costs in comparison to classical usage scenarios. Hence, a first contribution of the thesis at hand is the analysis of the potential cost advantage. The analysis quantifies the relevant cost factors and compares different usage scenarios. It finally delivers tangible figures about the arising costs and, consequently, also about the potential cost savings when using underutilized resources. However, the realization of the potential cost savings is hampered by the variability of the available computing capacities. The progress of a computational process can be slowed down or even lost by sudden increments of the resource utilization or (even worse) by shutdowns or crashes. Obviously, accurate predictions of the future available computing capacities could alleviate the mentioned problem. Such predictions were useful for several purposes (e.g. scheduling or optimization of checkpoint intervals), whereas in most cases the prediction of a single value (for example the expectancy) is only of limited value. Therefore, the work at hand examines predictions of probability distributions. First, the problem of the assessment of different prediction methods is extensively discussed. The main problems of existing assessment criteria are clearly identified, and more useful criteria are proposed. Second, the problem of the prediction itself is analyzed. For this purpose, conventional methods as described in the literature are examined and finally enhanced. The modified methods are then compared to the conventional methods by using the proposed assessment criteria and data from real world computers. The results clearly show the advantage of the methods proposed in the thesis at hand

    Efficient simulation of communication systems on a desktop grid

    Simulation is an important part of the design cycle of modern communication systems. As communication systems grow more sophisticated, the computational burden of these simulations can become excessive. The need to rapidly bring systems to market generally precludes the use of a single computer, and drives a demand for parallel computation. While this demand could be satisfied by the development of dedicated infrastructure, a more efficient option is to harness the unused computational cycles of underutilized desktop computers located throughout the organization.;In this thesis, a new paradigm for parallelizing communication simulations is proposed and developed. A desktop grid is created by running a compute engine as a background job on existing computers located throughout the University. The compute engine takes advantage of unused cycles to run simulations, and reports its results back to a server. The simulation itself is developed and launched from a client machine using Matlab, an application that has widespread acceptance within the communications industry. To obviate the need for a Matlab license on every machine running the compute engine, the simulation is first compiled to stand-alone executable code, and the executable and input data files are distributed to the grid machines over the Internet. To illustrate the performance improvement, a campaign of 16 distinct simulations corresponding to the IEEE 802.11a standard is run over the grid. Each compute engine executes a single simulation corresponding to one of eight modulation and coding schemes and one of two channel models. The improvement in execution time is quantified by a tool that was developed to monitor the activity of the grid