Search CORE

69 research outputs found

Agent-based resource management for grid computing

Author: Cao Junwei
Publication venue
Publication date
Field of study

A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capability. An ideal grid environment should provide access to the available resources in a seamless manner. Resource management is an important infrastructural component of a grid computing environment. The overall aim of resource management is to efficiently schedule applications that need to utilise the available resources in the grid environment. Such goals within the high performance community will rely on accurate performance prediction capabilities. An existing toolkit, known as PACE (Performance Analysis and Characterisation Environment), is used to provide quantitative data concerning the performance of sophisticated applications running on high performance resources. In this thesis an ASCI (Accelerated Strategic Computing Initiative) kernel application, Sweep3D, is used to illustrate the PACE performance prediction capabilities. The validation results show that a reasonable accuracy can be obtained, cross-platform comparisons can be easily undertaken, and the process benefits from a rapid evaluation time. While extremely well-suited for managing a locally distributed multi-computer, the PACE functions do not map well onto a wide-area environment, where heterogeneity, multiple administrative domains, and communication irregularities dramatically complicate the job of resource management. Scalability and adaptability are two key challenges that must be addressed. In this thesis, an A4 (Agile Architecture and Autonomous Agents) methodology is introduced for the development of large-scale distributed software systems with highly dynamic behaviours. An agent is considered to be both a service provider and a service requestor. Agents are organised into a hierarchy with service advertisement and discovery capabilities. There are four main performance metrics for an A4 system: service discovery speed, agent system efficiency, workload balancing, and discovery success rate. Coupling the A4 methodology with PACE functions, results in an Agent-based Resource Management System (ARMS), which is implemented for grid computing. The PACE functions supply accurate performance information (e. g. execution time) as input to a local resource scheduler on the fly. At a meta-level, agents advertise their service information and cooperate with each other to discover available resources for grid-enabled applications. A Performance Monitor and Advisor (PMA) is also developed in ARMS to optimise the performance of the agent behaviours. The PMA is capable of performance modelling and simulation about the agents in ARMS and can be used to improve overall system performance. The PMA can monitor agent behaviours in ARMS and reconfigure them with optimised strategies, which include the use of ACTs (Agent Capability Tables), limited service lifetime, limited scope for service advertisement and discovery, agent mobility and service distribution, etc. The main contribution of this work is that it provides a methodology and prototype implementation of a grid Resource Management System (RMS). The system includes a number of original features that cannot be found in existing research solutions

Warwick Research Archives Portal Repository

Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000

Author: Cavin Xavier
Publication venue: HAL CCSD
Publication date: 01/01/1999
Field of study

Colloque avec actes sans comité de lecture.The ccNUMA architecture of the SGI Origin2000 has been shown to perform and scale for a wide range of scientific and engineering applications. This paper focuses on a well known computer graphics hierarchical algorithm - wavelet radiosity - whose parallelization is made challenging by its irregular, dynamic and unpredictable characteristics. Our previous experimentations, based on a naive parallelization, showed that the Origin2000 hierarchical memory structure was well suited to handle the natural data locality exhibited by this hierarchical algorithm. However, our crude load balancing strategy was clearly insufficient to benefit from the whole Origin2000 power. We present here a fine load balancing analysis and then propose several enhancements, namely "lazy copy" and "lure", that greatly reduce locks and synchronization barriers idle time. The new parallel algorithm is experimented on a 64 processors Origin2000. Even if in theory, a communication over-cost has been introduced, we show that data locality is still preserved. The final performance evaluation shows a quasi optimal behavior, at least until the 32-processor scale. Hereafter, a problematic trouble spot has to be identified to explain the performance degradation observed at the 64-processor scale

INRIA a CCSD electronic archive server

Overlapping Multi-Processing and Graphics Hardware Acceleration: Performance Evaluation

Author: Alonso Laurent
Cavin Xavier
Paul Jean-Claude
Publication venue: HAL CCSD
Publication date: 01/01/1999
Field of study

Colloque avec actes et comité de lecture.Recently, multi-processing has been shown to deliver good performance in rendering. However, in some applications, processors spend too much time executing tasks that could be more efficiently done through intensive use of new graphics hardware. We present in this paper a novel solution combining multi-processing and advanced graphics hardware, where graphics pipelines are used both for classical visualization tasks and to advantageously perform geometric calculations while remaining computations are handled by multi-processors. The experiment is based on an implementation of a new parallel wavelet radiosity algorithm. The application is executed on the SGI Origin2000 connected to the SGI InfiniteReality2 rendering pipeline. A performance evaluation is presented. Keeping in mind that the approach can benefit all available workstations and super-computers, from small scale (2 processors and 1 graphics pipeline) to large scale (

p

processors and

n

graphics pipelines), we highlight some important bottlenecks that impede performance. However, our results show that this approach could be a promising avenue for scientific and engineering simulation and visualization applications that need intensive geometric calculations

Crossref

INRIA a CCSD electronic archive server

Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids

Author: Kaushik Dinesh Kumar
Publication venue: ODU Digital Commons
Publication date: 01/01/2002
Field of study

This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework. Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates. Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint

Old Dominion University

Scalable implementations of MPI atomicity for concurrent overlapping I/O

Author: Choudhary Alok
Coloma Kenin
Liao Wei-keng
Pundit Neil
Russell Eric
Thiruvathukal George K.
Ward Lee
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2003
Field of study

For concurrent I/O operations, atomicity defines the results in the overlapping file regions simultaneously read/written by requesting processes. Atomicity has been well studied at the file system level, such as POSIX standard. In this paper, we investigate the problems arising from the implementation of MPI atomicity for concurrent overlapping write access and provide a few programming solutions. Since the MPI definition of atomicity differs from the POSIX one, an implementation that simply relies on the POSIX file systems does not guarantee correct MPI semantics. To have a correct implementation of atomic I/O in MPI, we examine the efficiency of three approaches: 1) file locking, 2) graph-coloring, and 3) process-rank ordering. Performance complexity for these methods are analyzed and their experimental results are presented for file systems including NFS, SGI’s XFS, and IBM’s GPFS

Crossref

Loyola eCommons

HPCCP/CAS Workshop Proceedings 1998

Author: Mata Ellen
Schulbach Catherine
Schulbach Catherine
Publication venue
Publication date
Field of study

This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey

NASA Technical Reports Server

Recommended from our members

Development and Validation of a Hierarchical Memory Model Incorporating CPU- and Memory-Operation Overlap

Author: Bassetti Federico
Lubeck Olaf M.
Luo Yong
Wasserman Harvey J.
Publication venue: Los Alamos National Laboratory
Publication date: 31/12/1997
Field of study

Distributed shared memory architectures (DSM`s) such as the Origin 2000 are being implemented which extend the concept of single-processor cache hierarchies across an entire physically-distributed multiprocessor machine. The scalability of a DSM machine is inherently tied to memory hierarchy performance, including such issues as latency hiding techniques in the architecture, global cache-coherence protocols, memory consistency models and, of course, the inherent locality of reference in algorithms of interest. In this paper, we characterize application performance with a {open_quotes}memory-centric{close_quotes} view. Using a simple mean value analysis (MVA) strategy and empirical performance data, we infer the contribution of each level in the memory system to the application`s overall cycles per instruction (cpi). We account for the overlap of processor execution with memory accesses - a key parameter which is not directly measurable on the Origin systems. We infer the separate contributions of three major architecture features in the memory subsystem of the Origin 2000: cache size, outstanding loads-under-miss, and memory latency

UNT Digital Library

Rouse Chains with Excluded Volume Interactions: Linear Viscoelasticity

Author: Bird R. B.
Bird R. B.
des Cloizeaux J.
Doi M.
Prakash J. R.
Prakash J. R.
Prakash J. R.
Press W. H.
Rouse P. E.
Schieber J. D.
Schäfer L.
Wedgewood L. E.
Wedgewood L. E.
Zylka W.
Zylka W.
Öttinger H. C.
Öttinger H. C.
Öttinger H. C.
Öttinger H. C.
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2000
Field of study

Linear viscoelastic properties for a dilute polymer solution are predicted by modeling the solution as a suspension of non-interacting bead-spring chains. The present model, unlike the Rouse model, can describe the solution's rheological behavior even when the solvent quality is good, since excluded volume effects are explicitly taken into account through a narrow Gaussian repulsive potential between pairs of beads in a bead-spring chain. The use of the narrow Gaussian potential, which tends to the more commonly used delta-function repulsive potential in the limit of a width parameter "d" going to zero, enables the performance of Brownian dynamics simulations. The simulations results, which describe the exact behavior of the model, indicate that for chains of arbitrary but finite length, a delta-function potential leads to equilibrium and zero shear rate properties which are identical to the predictions of the Rouse model. On the other hand, a non-zero value of "d" gives rise to a prediction of swelling at equilibrium, and an increase in zero shear rate properties relative to their Rouse model values. The use of a delta-function potential appears to be justified in the limit of infinite chain length. The exact simulation results are compared with those obtained with an approximate solution which is based on the assumption that the non-equilibrium configurational distribution function is Gaussian. The Gaussian approximation is shown to be exact to first order in the strength of excluded volume interaction, and is found to be accurate above a threshold value of "d", for given values of chain length and strength of excluded volume interaction.Comment: Revised version. Long chain limit analysis has been deleted. An improved and corrected examination of the long chain limit will appear as a separate posting. 32 pages, 9 postscript figures, LaTe

arXiv.org e-Print Archive

CiteSeerX

Crossref

Kaiserslauterer uniweiter elektronischer Dokumentenserver

Empirical and Statistical Application Modeling Using on -Chip Performance Monitors.

Author: Cameron Kirk William
Publication venue: LSU Digital Commons
Publication date: 01/01/2000
Field of study

To analyze the performance of applications and architectures, both programmers and architects desire formal methods to explain anomalous behavior. To this end, we present various methods that utilize non-intrusive, performance-monitoring hardware only recently available on microprocessors to provide further explanations of observed behavior. All the methods attempt to characterize and explain the instruction-level parallelism achieved by codes on different architectures. We also present a prototype tool automating the analysis process to exploit the advantages of the empirical and statistical methods proposed. The empirical, statistical and hybrid methods are discussed and explained with case study results provided. The given methods further the wealth of tools available to programmer\u27s and architects for generally understanding the performance of scientific applications. Specifically, the models and tools presented provide new methods for evaluating and categorizing application performance. The empirical memory model serves to quantify the hierarchical memory performance of applications by inferring the incurred latencies of codes after the effect of latency hiding techniques are realized. The instruction-level model and its extensions model on-chip performance analytically giving insight into inherent performance bottlenecks in superscalar architectures. The statistical model and its hybrid extension provide other methods of categorizing codes via their statistical variations. The PTERA performance tool automates the use of performance counters for use by these methods across platforms making the modeling process easier still. These unique methods provide alternatives to performance modeling and categorizing not available previously in an attempt to utilize the inherent modeling capabilities of performance monitors on commodity processors for scientific applications

Louisiana State University