3,250 research outputs found
Analysis, Tracing, Characterization and Performance Modeling of Select ASCI Applications for BlueGene/L Using Parallel Discrete Event Simulation
Caltech's Jet Propulsion Laboratory (JPL) and Center for Advanced Computer Architecture (CACR) are conducting application and simulation analyses of Blue Gene/L[1] in order to establish a range of effectiveness of the architecture in performing important classes of computations and to determine the design sensitivity of the global interconnect network in support of real world ASCI application execution
Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance
The cubic Klein-Gordon equation is a simple but non-trivial partial
differential equation whose numerical solution has the main building blocks
required for the solution of many other partial differential equations. In this
study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve
the Klein-Gordon equation and strong scaling of the code is examined on
thirteen different machines for a problem size of 512^3. The results are useful
in assessing likely performance of other parallel fast Fourier transform based
programs for solving partial differential equations. The problem is chosen to
be large enough to solve on a workstation, yet also of interest to solve
quickly on a supercomputer, in particular for parametric studies. Unlike other
high performance computing benchmarks, for this problem size, the time to
solution will not be improved by simply building a bigger supercomputer.Comment: 10 page
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
Analytic, first-principles performance modeling of distributed-memory
applications is difficult due to a wide spectrum of random disturbances caused
by the application and the system. These disturbances (commonly called "noise")
destroy the assumptions of regularity that one usually employs when
constructing simple analytic models. Despite numerous efforts to quantify,
categorize, and reduce such effects, a comprehensive quantitative understanding
of their performance impact is not available, especially for long delays that
have global consequences for the parallel application. In this work, we
investigate various traces collected from synthetic benchmarks that mimic real
applications on simulated and real message-passing systems in order to pinpoint
the mechanisms behind delay propagation. We analyze the dependence of the
propagation speed of idle waves emanating from injected delays with respect to
the execution and communication properties of the application, study how such
delays decay under increased noise levels, and how they interact with each
other. We also show how fine-grained noise can make a system immune against the
adverse effects of propagating idle waves. Our results contribute to a better
understanding of the collective phenomena that manifest themselves in
distributed-memory parallel applications.Comment: 10 pages, 9 figures; title change
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Behavioral types in programming languages
A recent trend in programming language research is to use behav- ioral type theory to ensure various correctness properties of large- scale, communication-intensive systems. Behavioral types encompass concepts such as interfaces, communication protocols, contracts, and choreography. The successful application of behavioral types requires a solid understanding of several practical aspects, from their represen- tation in a concrete programming language, to their integration with other programming constructs such as methods and functions, to de- sign and monitoring methodologies that take behaviors into account. This survey provides an overview of the state of the art of these aspects, which we summarize as the pragmatics of behavioral types
Revisiting Matrix Product on Master-Worker Platforms
This paper is aimed at designing efficient parallel matrix-product algorithms
for heterogeneous master-worker platforms. While matrix-product is
well-understood for homogeneous 2D-arrays of processors (e.g., Cannon algorithm
and ScaLAPACK outer product algorithm), there are three key hypotheses that
render our work original and innovative:
- Centralized data. We assume that all matrix files originate from, and must
be returned to, the master.
- Heterogeneous star-shaped platforms. We target fully heterogeneous
platforms, where computational resources have different computing powers.
- Limited memory. Because we investigate the parallelization of large
problems, we cannot assume that full matrix panels can be stored in the worker
memories and re-used for subsequent updates (as in ScaLAPACK).
We have devised efficient algorithms for resource selection (deciding which
workers to enroll) and communication ordering (both for input and result
messages), and we report a set of numerical experiments on various platforms at
Ecole Normale Superieure de Lyon and the University of Tennessee. However, we
point out that in this first version of the report, experiments are limited to
homogeneous platforms
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
A previous version of this document was submitted for publication by october 2008.Communication overhead is one of the dominant factors that affect performance in high-performance computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more effort to write parallel code and making less readable code. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar), a task-based shared-memory programming model, enhanced with a restart mechanism allowing the programmer to introduce the asynchronism that is necessary to enable the effective communication/computation overlap in a productive way. We demonstrate the hybrid use of MPI/SMPSs with the high-performance LINPACK benchmark, which uses the lookahead technique to overlap communication and computation. MPI/SMPSs improves the performance of a pure MPI with look-ahead by 7,6% on a 1024 processors machine. In addition to better performance, hybrid MPI/SMPSs substantially reduces code complexity, it is less sensitive to network bandwidth and operating system noise, and improves the use of main memory.Postprint (published version
- …