2,606 research outputs found
Master/worker parallel discrete event simulation
The execution of parallel discrete event simulation across metacomputing infrastructures is examined. A master/worker architecture for parallel discrete event simulation is proposed providing robust executions under a dynamic set of services with system-level support for fault tolerance, semi-automated client-directed load balancing, portability across heterogeneous machines, and the ability to run codes on idle or time-sharing clients without significant interaction by users. Research questions and challenges associated with issues and limitations with the work distribution paradigm, targeted computational domain, performance metrics, and the intended class of applications to be used in this context are analyzed and discussed. A portable web services approach to master/worker parallel discrete event simulation is proposed and evaluated with subsequent optimizations to increase the efficiency of large-scale simulation execution through distributed master service design and intrinsic overhead reduction. New techniques for addressing challenges associated with optimistic parallel discrete event simulation across metacomputing such as rollbacks and message unsending with an inherently different computation paradigm utilizing master services and time windows are proposed and examined. Results indicate that a master/worker approach utilizing loosely coupled resources is a viable means for high throughput parallel discrete event simulation by enhancing existing computational capacity or providing alternate execution capability for less time-critical codes.Ph.D.Committee Chair: Fujimoto, Richard; Committee Member: Bader, David; Committee Member: Perumalla, Kalyan; Committee Member: Riley, George; Committee Member: Vuduc, Richar
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
LHCb distributed data analysis on the computing grid
LHCb is one of the four Large Hadron Collider (LHC) experiments based at CERN, the European Organisation for Nuclear Research. The LHC experiments will start taking an unprecedented amount of data when they come online in 2007. Since no single institute has the compute resources to handle this data, resources must be pooled to form the Grid. Where the Internet has made it possible to share information stored on computers across the world, Grid computing aims to provide access to computing power and storage capacity on geographically distributed systems. LHCb software applications must work seamlessly on the Grid allowing users to efficiently access distributed compute resources. It is essential to the success of the LHCb experiment that physicists can access data from the detector, stored in many heterogeneous systems, to perform distributed data analysis. This thesis describes the work performed to enable distributed data analysis for the LHCb experiment on the LHC Computing Grid
The Lattice Project: A Multi-model Grid Computing System
This thesis presents The Lattice Project, a system that combines multiple models of Grid computing. Grid computing is a paradigm for leveraging multiple distributed computational resources to solve fundamental scientific problems that require large amounts of computation. The system combines the traditional Service model of Grid computing with the Desktop model of Grid computing, and is thus capable of utilizing diverse resources such as institutional desktop computers, dedicated computing clusters, and machines volunteered by the general public to advance science. The production Grid system includes a fully-featured user interface, support for a large number of popular scientific applications, a robust Grid-level scheduler, and novel enhancements such as a Grid-wide file caching scheme. A substantial amount of scientific research has already been completed using The Lattice Project
Volunteer Computing on Distributed Untrusted Nodes
The growth in size and complexity of new software systems has highlighted the need of more efficient and faster building tools. The current research relies on automation and parallelization of tasks dividing and grouping software systems in dependent software packages. Some modern building systems as Open Build Service (OBS) centralize sources commitment and dependencies solving for Linux distributions. After, they distribute these heavy build tasks among several build hosts, to finally deliver the results to the community.
The problem with these building services is that as they are usually supported by non-commercial communities, the resources to maintain the build hosts are less. Because of this, the idea of distributing these jobs among new building hosts owned by volunteers is tempting. However, carrying out this idea brings new challenges and problems to be solved, concerning the new pool of untrusted, unreliable workers.
This thesis studies how the concept of volunteer computing can be applied to software package building, specifically to OBS. In the first part, the existing platforms of volunteer computing are examined showing the current research and the pros and cons of using them for our purposes.
The research of this thesis led to a different solution called Volunteer Worker System (VWS). The main concept is to provide a centralized system that serves OBS reliable trusted workers compiling the results sent by the volunteers. Each worker acts as a proxy between the untrusted volunteers and the OBS server itself, validating by multiple cross-checking the results obtained. The volunteers from the volunteer pool are grouped to serve each surrogate depending on OBS needs.
A simple proof-of-concept of the designed system was set-up on a network distributed environment. A host acting as Volunteer System groups and dispatches jobs coming from a host simulating OBS server to several volunteer workers in separate hosts. These volunteers send back their results to the Volunteer System to validate and forward them to OBS Server.
Ensuring security on the designed solution is one of the needs to deploy the system on a real-environment. The OBS instance receiving the volunteers work needs to be sure that the Volunteer System offering them is fully trusted. Also, a whole front-end system to attract and maintain volunteers needs to be implemented
Support for flexible and transparent distributed computing
Modern distributed computing developed from the traditional supercomputing community rooted firmly
in the culture of batch management. Therefore, the field has been dominated by queuing-based resource
managers and work flow based job submission environments where static resource demands needed be
determined and reserved prior to launching executions. This has made it difficult to support resource
environments (e.g. Grid, Cloud) where the available resources as well as the resource requirements
of applications may be both dynamic and unpredictable. This thesis introduces a flexible execution
model where the compute capacity can be adapted to fit the needs of applications as they change during
execution. Resource provision in this model is based on a fine-grained, self-service approach instead
of the traditional one-time, system-level model. The thesis introduces a middleware based Application
Agent (AA) that provides a platform for the applications to dynamically interact and negotiate resources
with the underlying resource infrastructure.
We also consider the issue of transparency, i.e., hiding the provision and management of the distributed
environment. This is the key to attracting public to use the technology. The AA not only replaces
user-controlled process of preparing and executing an application with a transparent software-controlled
process, it also hides the complexity of selecting right resources to ensure execution QoS. This service
is provided by an On-line Feedback-based Automatic Resource Configuration (OAC) mechanism cooperating
with the flexible execution model. The AA constantly monitors utility-based feedbacks from the
application during execution and thus is able to learn its behaviour and resource characteristics. This
allows it to automatically compose the most efficient execution environment on the fly and satisfy any
execution requirements defined by users. Two policies are introduced to supervise the information learning
and resource tuning in the OAC. The Utility Classification policy classifies hosts according to their
historical performance contributions to the application. According to this classification, the AA chooses
high utility hosts and withdraws low utility hosts to configure an optimum environment. The Desired
Processing Power Estimation (DPPE) policy dynamically configures the execution environment according
to the estimated desired total processing power needed to satisfy users’ execution requirements.
Through the introducing of flexibility and transparency, a user is able to run a dynamic/normal
distributed application anywhere with optimised execution performance, without managing distributed
resources. Based on the standalone model, the thesis further introduces a federated resource negotiation
framework as a step forward towards an autonomous multi-user distributed computing world
- …