1,298 research outputs found
Supporting shared data structures on distributed memory architectures
Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described
Design and Evaluation of a Collective IO Model for Loosely Coupled Petascale Programming
Loosely coupled programming is a powerful paradigm for rapidly creating
higher-level applications from scientific programs on petascale systems,
typically using scripting languages. This paradigm is a form of many-task
computing (MTC) which focuses on the passing of data between programs as
ordinary files rather than messages. While it has the significant benefits of
decoupling producer and consumer and allowing existing application programs to
be executed in parallel with no recoding, its typical implementation using
shared file systems places a high performance burden on the overall system and
on the user who will analyze and consume the downstream data. Previous efforts
have achieved great speedups with loosely coupled programs, but have done so
with careful manual tuning of all shared file system access. In this work, we
evaluate a prototype collective IO model for file-based MTC. The model enables
efficient and easy distribution of input data files to computing nodes and
gathering of output results from them. It eliminates the need for such manual
tuning and makes the programming of large-scale clusters using a loosely
coupled model easier. Our approach, inspired by in-memory approaches to
collective operations for parallel programming, builds on fast local file
systems to provide high-speed local file caches for parallel scripts, uses a
broadcast approach to handle distribution of common input data, and uses
efficient scatter/gather and caching techniques for input and output. We
describe the design of the prototype model, its implementation on the Blue
Gene/P supercomputer, and present preliminary measurements of its performance
on synthetic benchmarks and on a large-scale molecular dynamics application.Comment: IEEE Many-Task Computing on Grids and Supercomputers (MTAGS08) 200
Java Grande Forum Report: Making Java Work for High-End Computing
This document describes the Java Grande Forum and includes its initial deliverables.Theseare reports that convey a succinct set of recommendations from this forum to SunMicrosystems and other purveyors of Javaâ„¢ technology that will enable GrandeApplications to be developed with the Java programming language
Towards Performance Portable Programming for Distributed Heterogeneous Systems
Hardware heterogeneity is here to stay for high-performance computing.
Large-scale systems are currently equipped with multiple GPU accelerators per
compute node and are expected to incorporate more specialized hardware in the
future. This shift in the computing ecosystem offers many opportunities for
performance improvement; however, it also increases the complexity of
programming for such architectures. This work introduces a runtime framework
that enables effortless programming for heterogeneous systems while efficiently
utilizing hardware resources. The framework is integrated within a distributed
and scalable runtime system to facilitate performance portability across
heterogeneous nodes. Along with the design, this paper describes the
implementation and optimizations performed, achieving up to 300% improvement in
a shared memory benchmark and up to 10 times in distributed device
communication. Preliminary results indicate that our software incurs low
overhead and achieves 40% improvement in a distributed Jacobi proxy application
while hiding the idiosyncrasies of the hardware
- …