4,946 research outputs found
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
Circuit simulation using distributed waveform relaxation techniques
Simulation plays an important role in the design of integrated circuits. Due to high costs and large delays involved in their fabrication, simulation is commonly used to verify functionality and to predict performance before fabrication. This thesis describes analysis, implementation and performance evaluation of a distributed memory parallel waveform relaxation technique for the electrical circuit simulation of MOS VLSI circuits. The waveform relaxation technique exhibits inherent parallelism due to the partitioning of a circuit into a number of sub-circuits. These subcircuits can be concurrently simulated on parallel processors. Different forms of parallelism in the direct method and the waveform relaxation technique are studied. An analysis of single queue and distributed queue approaches to implement parallel waveform relaxation on distributed memory machines is performed and their performance implications are studied. The distributed queue approach selected for exploiting the coarse grain parallelism across sub-circuits is described. Parallel waveform relaxation programs based on Gauss-Seidel and Gauss-Jacobi techniques are implemented using a network of eight Transputers. Static and dynamic load balancing strategies are studied. A dynamic load balancing algorithm is developed and implemented. Results of parallel implementation are analyzed to identify sources of bottlenecks. This thesis has demonstrated the applicability of a low cost distributed memory multi-computer system for simulation of MOS VLSI circuits. Speed-up measurements prove that a five times improvement in the speed of calculations can be achieved using a full window parallel Gauss-Jacobi waveform relaxation algorithm. Analysis of overheads shows that load imbalance is the major source of overhead and that the fraction of the computation which must be performed sequentially is very low. Communication overhead depends on the nature of the parallel architecture and the design of communication mechanisms. The run-time environment (parallel processing framework) developed in this research exploits features of the Transputer architecture to reduce the effect of the communication overhead by effectively overlapping computation with communications, and running communications processes at a higher priority. This research will contribute to the development of low cost, high performance workstations for computer-aided design and analysis of VLSI circuits
Dynamic Load Balancing Techniques for Particulate Flow Simulations
Parallel multiphysics simulations often suffer from load imbalances
originating from the applied coupling of algorithms with spatially and
temporally varying workloads. It is thus desirable to minimize these imbalances
to reduce the time to solution and to better utilize the available hardware
resources. Taking particulate flows as an illustrating example application, we
present and evaluate load balancing techniques that tackle this challenging
task. This involves a load estimation step in which the currently generated
workload is predicted. We describe in detail how such a workload estimator can
be developed. In a second step, load distribution strategies like space-filling
curves or graph partitioning are applied to dynamically distribute the load
among the available processes. To compare and analyze their performance, we
employ these techniques to a benchmark scenario and observe a reduction of the
load imbalances by almost a factor of four. This results in a decrease of the
overall runtime by 14% for space-filling curves
Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations
We present a simple, parallel and distributed algorithm for setting up and
partitioning a sparse representation of a regular discretized simulation
domain. This method is scalable for a large number of processes even for
complex geometries and ensures load balance between the domains, reasonable
communication interfaces, and good data locality within the domain. Applying
this scheme to a list-based lattice Boltzmann flow solver can achieve similar
or even higher flow solver performance than widely used standard graph
partition based tools such as METIS and PT-SCOTCH
Algorithms for Mapping Parallel Processes onto Grid and Torus Architectures
Static mapping is the assignment of parallel processes to the processing
elements (PEs) of a parallel system, where the assignment does not change
during the application's lifetime. In our scenario we model an application's
computations and their dependencies by an application graph. This graph is
first partitioned into (nearly) equally sized blocks. These blocks need to
communicate at block boundaries. To assign the processes to PEs, our goal is to
compute a communication-efficient bijective mapping between the blocks and the
PEs.
This approach of partitioning followed by bijective mapping has many degrees
of freedom. Thus, users and developers of parallel applications need to know
more about which choices work for which application graphs and which parallel
architectures. To this end, we not only develop new mapping algorithms (derived
from known greedy methods). We also perform extensive experiments involving
different classes of application graphs (meshes and complex networks),
architectures of parallel computers (grids and tori), as well as different
partitioners and mapping algorithms. Surprisingly, the quality of the
partitions, unless very poor, has little influence on the quality of the
mapping.
More importantly, one of our new mapping algorithms always yields the best
results in terms of the quality measure maximum congestion when the application
graphs are complex networks. In case of meshes as application graphs, this
mapping algorithm always leads in terms of maximum congestion AND maximum
dilation, another common quality measure.Comment: Accepted at PDP-201
09061 Abstracts Collection -- Combinatorial Scientific Computing
From 01.02.2009 to 06.02.2009, the Dagstuhl Seminar 09061 ``Combinatorial Scientific Computing \u27\u27 was held in Schloss Dagstuhl -- Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
Combining spectral sequencing and parallel simulated annealing for the MinLA problem
In this paper we present and analyze new sequential and parallel
heuristics to approximate the Minimum Linear Arrangement problem
(MinLA). The heuristics consist in obtaining a first global solution
using Spectral Sequencing and improving it locally through Simulated
Annealing. In order to accelerate the annealing process, we present a
special neighborhood distribution that tends to favor moves with high
probability to be accepted. We show how to make use of this
neighborhood to parallelize the Metropolis stage on distributed memory
machines by mapping partitions of the input graph to processors and
performing moves concurrently. The paper reports the results obtained
with this new heuristic when applied to a set of large graphs,
including graphs arising from finite elements methods and graphs
arising from VLSI applications. Compared to other heuristics, the
measurements obtained show that the new heuristic improves the
solution quality, decreases the running time and offers an excellent
speedup when ran on a commodity network made of nine personal
computers.Postprint (published version
- …