262 research outputs found
Conflict-free star-access in parallel memory systems
We study conflict-free data distribution schemes in parallel memories in multiprocessor system architectures. Given a host graph G, the problem is to map the nodes of G into memory modules such that any instance of a template type T in G can be accessed without memory conflicts. A conflict occurs if two or more nodes of T are mapped to the same memory module. The mapping algorithm should: (i) be fast in terms of data access (possibly mapping each node in constant time); (ii) minimize the required number of memory modules for accessing any instance in G of the given template type; and (iii) guarantee load balancing on the modules. In this paper, we consider conflict-free access to star templates. i.e., to any node of G along with all of its neighbors. Such a template type arises in many classical algorithms like breadth-first search in a graph, message broadcasting in networks, and nearest neighbor based approximation in numerical computation. We consider the star-template access problem on two specific host graphs-tori and hypercubes-that are also popular interconnection network topologies. The proposed conflict-free mappings on these graphs are fast, use an optimal or provably good number of memory modules, and guarantee load balancing. (C) 2006 Elsevier Inc. All rights reserved
Performance of a parallel code for the Euler equations on hypercube computers
The performance of hypercubes were evaluated on a computational fluid dynamics problem and the parallel environment issues were considered that must be addressed, such as algorithm changes, implementation choices, programming effort, and programming environment. The evaluation focuses on a widely used fluid dynamics code, FLO52, which solves the two dimensional steady Euler equations describing flow around the airfoil. The code development experience is described, including interacting with the operating system, utilizing the message-passing communication system, and code modifications necessary to increase parallel efficiency. Results from two hypercube parallel computers (a 16-node iPSC/2, and a 512-node NCUBE/ten) are discussed and compared. In addition, a mathematical model of the execution time was developed as a function of several machine and algorithm parameters. This model accurately predicts the actual run times obtained and is used to explore the performance of the code in interesting but yet physically realizable regions of the parameter space. Based on this model, predictions about future hypercubes are made
Optimizing Data Intensive Flows for Networks on Chips
Data flow analysis and optimization is considered for homogeneous rectangular
mesh networks. We propose a flow matrix equation which allows a closed-form
characterization of the nature of the minimal time solution, speedup and a
simple method to determine when and how much load to distribute to processors.
We also propose a rigorous mathematical proof about the flow matrix optimal
solution existence and that the solution is unique. The methodology introduced
here is applicable to many interconnection networks and switching protocols (as
an example we examine toroidal networks and hypercube networks in this paper).
An important application is improving chip area and chip scalability for
networks on chips processing divisible style loads
Mapping unstructured grid problems to the connection machine
We present a highly parallel graph mapping technique that enables one to solve unstructured grid problems on massively parallel computers. Many implicit and explicit methods for solving discretizated partial differential equations require each point in the discretization to exchange data with its neighboring points every time step or iteration. The time spent communicating can limit the high performance promised by massively parallel computing. To eliminate this bottleneck, we map the graph of the irregular problem to the graph representing the interconnection topology of the computer such that the sum of the distances that the messages travel is minimized. We show that, in comparison to a naive assignment of processors, our heuristic mapping algorithm significantly reduces the communication time on the Connection Machine, CM-2
SFC-based Communication Metadata Encoding for Adaptive Mesh
This volume of the series “Advances in Parallel Computing” contains the proceedings of the International Conference on Parallel Programming – ParCo 2013 – held from 10 to 13 September 2013 in Garching, Germany. The conference was hosted by the Technische Universität München (Department of Informatics) and the Leibniz Supercomputing Centre.The present paper studies two adaptive mesh refinement (AMR) codes
whose grids rely on recursive subdivison in combination with space-filling curves
(SFCs). A non-overlapping domain decomposition based upon these SFCs yields
several well-known advantageous properties with respect to communication demands,
balancing, and partition connectivity. However, the administration of the
meta data, i.e. to track which partitions exchange data in which cardinality, is nontrivial
due to the SFC’s fractal meandering and the dynamic adaptivity. We introduce
an analysed tree grammar for the meta data that restricts it without loss of
information hierarchically along the subdivision tree and applies run length encoding.
Hence, its meta data memory footprint is very small, and it can be computed
and maintained on-the-fly even for permanently changing grids. It facilitates a forkjoin
pattern for shared data parallelism. And it facilitates replicated data parallelism
tackling latency and bandwidth constraints respectively due to communication in
the background and reduces memory requirements by avoiding adjacency information
stored per element. We demonstrate this at hands of shared and distributed
parallelized domain decompositions.This work was supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Centre “Invasive Computing (SFB/TR 89). It is
partially based on work supported by Award No. UK-c0020, made by the King Abdullah
University of Science and Technology (KAUST)
- …