4,093 research outputs found
Automatic Methods for Hiding Latency in Parallel and Distributed Computation
In this paper we describe methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks. For example, given any dataflow type of algorithm that runs in T steps on an n-node ring with unit link delays, we show how to run the algorithm in O(T) steps on any n-node bounded-degree connected network with average link delay O(1). This is a significant improvement over prior approaches to latency hiding, which require slowdowns proportional to the maximum link delay. In the case when the network has average link delay dave, our simulation runs in O(√daveT) steps using n/√dave processors, thereby preserving efficiency. We also show how to efficiently simulate an n × n array with unit link delays using slowdown Õ (d&frac23ave) on a two-dimensional array with average link delay dave. Last, we present results for the case in which large local databases are involved in the computation
Optimization of image processing algorithms via communication hiding in distributed processing systems
Real-time image processing is an important topic studied in the realm of computer systems. The task of real-time image processing is found in a wide range of applications, from multimedia systems to automobiles to military systems. Typically these systems require high throughput and low latency to perform at their required specifications. Therefore, hardware, software, and communications optimizations in these systems are very important factors in meeting these specifications. This thesis analyzes the implementation and optimization of a real-world image processing system destined for an aircraft environment. It discusses the steps of optimizing the software in the system, and then looks at how the system can be distributed over multiple processing nodes via functional pipelining. Next, the thesis discusses the optimization of interprocessor communication via communication hiding. Finally, it analyzes whether communication hiding is even necessary given today\u27s high-speed networking and communication interfaces
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure
- …