69 research outputs found
Agent-based resource management for grid computing
A computational grid is a hardware and software infrastructure that provides
dependable, consistent, pervasive, and inexpensive access to high-end
computational capability. An ideal grid environment should provide access to the
available resources in a seamless manner. Resource management is an important
infrastructural component of a grid computing environment. The overall aim of
resource management is to efficiently schedule applications that need to utilise the
available resources in the grid environment. Such goals within the high
performance community will rely on accurate performance prediction capabilities.
An existing toolkit, known as PACE (Performance Analysis and Characterisation
Environment), is used to provide quantitative data concerning the performance of
sophisticated applications running on high performance resources. In this thesis an
ASCI (Accelerated Strategic Computing Initiative) kernel application, Sweep3D,
is used to illustrate the PACE performance prediction capabilities. The validation
results show that a reasonable accuracy can be obtained, cross-platform
comparisons can be easily undertaken, and the process benefits from a rapid
evaluation time. While extremely well-suited for managing a locally distributed
multi-computer, the PACE functions do not map well onto a wide-area
environment, where heterogeneity, multiple administrative domains, and communication irregularities dramatically complicate the job of resource
management. Scalability and adaptability are two key challenges that must be
addressed.
In this thesis, an A4 (Agile Architecture and Autonomous Agents) methodology is
introduced for the development of large-scale distributed software systems with
highly dynamic behaviours. An agent is considered to be both a service provider
and a service requestor. Agents are organised into a hierarchy with service
advertisement and discovery capabilities. There are four main performance
metrics for an A4 system: service discovery speed, agent system efficiency,
workload balancing, and discovery success rate.
Coupling the A4 methodology with PACE functions, results in an Agent-based
Resource Management System (ARMS), which is implemented for grid
computing. The PACE functions supply accurate performance information (e. g.
execution time) as input to a local resource scheduler on the fly. At a meta-level,
agents advertise their service information and cooperate with each other to
discover available resources for grid-enabled applications. A Performance
Monitor and Advisor (PMA) is also developed in ARMS to optimise the
performance of the agent behaviours.
The PMA is capable of performance modelling and simulation about the agents in
ARMS and can be used to improve overall system performance. The PMA can
monitor agent behaviours in ARMS and reconfigure them with optimised
strategies, which include the use of ACTs (Agent Capability Tables), limited
service lifetime, limited scope for service advertisement and discovery, agent
mobility and service distribution, etc.
The main contribution of this work is that it provides a methodology and
prototype implementation of a grid Resource Management System (RMS). The
system includes a number of original features that cannot be found in existing
research solutions
Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000
Colloque avec actes sans comité de lecture.The ccNUMA architecture of the SGI Origin2000 has been shown to perform and scale for a wide range of scientific and engineering applications. This paper focuses on a well known computer graphics hierarchical algorithm - wavelet radiosity - whose parallelization is made challenging by its irregular, dynamic and unpredictable characteristics. Our previous experimentations, based on a naive parallelization, showed that the Origin2000 hierarchical memory structure was well suited to handle the natural data locality exhibited by this hierarchical algorithm. However, our crude load balancing strategy was clearly insufficient to benefit from the whole Origin2000 power. We present here a fine load balancing analysis and then propose several enhancements, namely "lazy copy" and "lure", that greatly reduce locks and synchronization barriers idle time. The new parallel algorithm is experimented on a 64 processors Origin2000. Even if in theory, a communication over-cost has been introduced, we show that data locality is still preserved. The final performance evaluation shows a quasi optimal behavior, at least until the 32-processor scale. Hereafter, a problematic trouble spot has to be identified to explain the performance degradation observed at the 64-processor scale
Overlapping Multi-Processing and Graphics Hardware Acceleration: Performance Evaluation
Colloque avec actes et comité de lecture.Recently, multi-processing has been shown to deliver good performance in rendering. However, in some applications, processors spend too much time executing tasks that could be more efficiently done through intensive use of new graphics hardware. We present in this paper a novel solution combining multi-processing and advanced graphics hardware, where graphics pipelines are used both for classical visualization tasks and to advantageously perform geometric calculations while remaining computations are handled by multi-processors. The experiment is based on an implementation of a new parallel wavelet radiosity algorithm. The application is executed on the SGI Origin2000 connected to the SGI InfiniteReality2 rendering pipeline. A performance evaluation is presented. Keeping in mind that the approach can benefit all available workstations and super-computers, from small scale (2 processors and 1 graphics pipeline) to large scale ( processors and graphics pipelines), we highlight some important bottlenecks that impede performance. However, our results show that this approach could be a promising avenue for scientific and engineering simulation and visualization applications that need intensive geometric calculations
Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids
This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework.
Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates.
Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint
Scalable implementations of MPI atomicity for concurrent overlapping I/O
For concurrent I/O operations, atomicity defines the results in the overlapping file regions simultaneously read/written by requesting processes. Atomicity has been well studied at the file system level, such as POSIX standard. In this paper, we investigate the problems arising from the implementation of MPI atomicity for concurrent overlapping write access and provide a few programming solutions. Since the MPI definition of atomicity differs from the POSIX one, an implementation that simply relies on the POSIX file systems does not guarantee correct MPI semantics. To have a correct implementation of atomic I/O in MPI, we examine the efficiency of three approaches: 1) file locking, 2) graph-coloring, and 3) process-rank ordering. Performance complexity for these methods are analyzed and their experimental results are presented for file systems including NFS, SGI’s XFS, and IBM’s GPFS
HPCCP/CAS Workshop Proceedings 1998
This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey
Recommended from our members
Development and Validation of a Hierarchical Memory Model Incorporating CPU- and Memory-Operation Overlap
Distributed shared memory architectures (DSM`s) such as the Origin 2000 are being implemented which extend the concept of single-processor cache hierarchies across an entire physically-distributed multiprocessor machine. The scalability of a DSM machine is inherently tied to memory hierarchy performance, including such issues as latency hiding techniques in the architecture, global cache-coherence protocols, memory consistency models and, of course, the inherent locality of reference in algorithms of interest. In this paper, we characterize application performance with a {open_quotes}memory-centric{close_quotes} view. Using a simple mean value analysis (MVA) strategy and empirical performance data, we infer the contribution of each level in the memory system to the application`s overall cycles per instruction (cpi). We account for the overlap of processor execution with memory accesses - a key parameter which is not directly measurable on the Origin systems. We infer the separate contributions of three major architecture features in the memory subsystem of the Origin 2000: cache size, outstanding loads-under-miss, and memory latency
Rouse Chains with Excluded Volume Interactions: Linear Viscoelasticity
Linear viscoelastic properties for a dilute polymer solution are predicted by
modeling the solution as a suspension of non-interacting bead-spring chains.
The present model, unlike the Rouse model, can describe the solution's
rheological behavior even when the solvent quality is good, since excluded
volume effects are explicitly taken into account through a narrow Gaussian
repulsive potential between pairs of beads in a bead-spring chain. The use of
the narrow Gaussian potential, which tends to the more commonly used
delta-function repulsive potential in the limit of a width parameter "d" going
to zero, enables the performance of Brownian dynamics simulations. The
simulations results, which describe the exact behavior of the model, indicate
that for chains of arbitrary but finite length, a delta-function potential
leads to equilibrium and zero shear rate properties which are identical to the
predictions of the Rouse model. On the other hand, a non-zero value of "d"
gives rise to a prediction of swelling at equilibrium, and an increase in zero
shear rate properties relative to their Rouse model values. The use of a
delta-function potential appears to be justified in the limit of infinite chain
length. The exact simulation results are compared with those obtained with an
approximate solution which is based on the assumption that the non-equilibrium
configurational distribution function is Gaussian. The Gaussian approximation
is shown to be exact to first order in the strength of excluded volume
interaction, and is found to be accurate above a threshold value of "d", for
given values of chain length and strength of excluded volume interaction.Comment: Revised version. Long chain limit analysis has been deleted. An
improved and corrected examination of the long chain limit will appear as a
separate posting. 32 pages, 9 postscript figures, LaTe
Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.
To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications
- …