32,554 research outputs found
Software for mesh partitioning
Graph partitioning is a fundamental problem in many scientific contexts. Algorithms that find a good partitioning of highly unstructured graphs are critical for developing efficient solutions for a wide range of problems in many application areas on both serial and parallel computers. For example, large‐scale numerical simulations on parallel computers, such as those based on finite element methods; require the distribution of the finite element mesh to the processors. This distribution must be done so that the number of elements assigned to each processor is the same, and the number of adjacent elements assigned to different processors is minimized. The goal of the first condition is to balance the computations among the processors. The goal of the second condition is to minimize the communication resulting from the placement of adjacent elements to different processors. Graph partitioning can be used to successfully satisfy these conditions by first modeling the finite element mesh by a graph, and then partitioning it into equal parts. Software for graph partitioning is widely available. This document contains a brief summary of the most famous ones and a detailed explanation of MeTis Graph Partitioning Software and a Matlab toolbox, justified in the following section the choice of this software
Software for mesh partitioning
Graph partitioning is a fundamental problem in many scientific contexts. Algorithms that find a good partitioning of highly unstructured graphs are critical for developing efficient solutions for a wide range of problems in many application areas on both serial and parallel computers. For example, large‐scale numerical simulations on parallel computers, such as those based on finite element methods; require the distribution of the finite element mesh to the processors. This distribution must be done so that the number of elements assigned to each processor is the same, and the number of adjacent elements assigned to different processors is minimized. The goal of the first condition is to balance the computations among the processors. The goal of the second condition is to minimize the communication resulting from the placement of adjacent elements to different processors. Graph partitioning can be used to successfully satisfy these conditions by first modeling the finite element mesh by a graph, and then partitioning it into equal parts. Software for graph partitioning is widely available. This document contains a brief summary of the most famous ones and a detailed explanation of MeTis Graph Partitioning Software and a Matlab toolbox, justified in the following section the choice of this software
Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code
Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance.
In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented
Scalable Graph Convolutional Network Training on Distributed-Memory Systems
Graph Convolutional Networks (GCNs) are extensively utilized for deep
learning on graphs. The large data sizes of graphs and their vertex features
make scalable training algorithms and distributed memory systems necessary.
Since the convolution operation on graphs induces irregular memory access
patterns, designing a memory- and communication-efficient parallel algorithm
for GCN training poses unique challenges. We propose a highly parallel training
algorithm that scales to large processor counts. In our solution, the large
adjacency and vertex-feature matrices are partitioned among processors. We
exploit the vertex-partitioning of the graph to use non-blocking point-to-point
communication operations between processors for better scalability. To further
minimize the parallelization overheads, we introduce a sparse matrix
partitioning scheme based on a hypergraph partitioning model for full-batch
training. We also propose a novel stochastic hypergraph model to encode the
expected communication volume in mini-batch training. We show the merits of the
hypergraph model, previously unexplored for GCN training, over the standard
graph partitioning model which does not accurately encode the communication
costs. Experiments performed on real-world graph datasets demonstrate that the
proposed algorithms achieve considerable speedups over alternative solutions.
The optimizations achieved on communication costs become even more pronounced
at high scalability with many processors. The performance benefits are
preserved in deeper GCNs having more layers as well as on billion-scale graphs.Comment: To appear in PVLDB'2
Large constraint length high speed viterbi decoder based on a modular hierarchial decomposition of the deBruijn graph
A method of formulating and packaging decision-making elements into a long constraint length Viterbi decoder which involves formulating the decision-making processors as individual Viterbi butterfly processors that are interconnected in a deBruijn graph configuration. A fully distributed architecture, which achieves high decoding speeds, is made feasible by novel wiring and partitioning of the state diagram. This partitioning defines universal modules, which can be used to build any size decoder, such that a large number of wires is contained inside each module, and a small number of wires is needed to connect modules. The total system is modular and hierarchical, and it implements a large proportion of the required wiring internally within modules and may include some external wiring to fully complete the deBruijn graph. pg,14
Recommended from our members
Automatic data/program partitioning using the single assignment principle
Loosely-coupled MIMD architectures do not suffer from memory contention; hence large numbers of processors may be utilized. The main problem, however, is how to partition data and programs in order to exploit the available parallelism. In this paper we show that efficient schemes for automatic data/program partitioning and synchronization may be employed if single assignment is used. Using simulations of program loops common to scientific computations (the Livermore Loops), we demonstrate that only a small fraction of data accesses are remote and thus the degradation in network performance due to multiprocessing is minimal
A Hierarchical Partitioning Strategy for an Efficient Parallelization of the Multilevel Fast Multipole Algorithm
Cataloged from PDF version of article.We present a novel hierarchical partitioning strategy
for the efficient parallelization of the multilevel fast multipole algorithm
(MLFMA) on distributed-memory architectures to solve
large-scale problems in electromagnetics. Unlike previous parallelization
techniques, the tree structure of MLFMA is distributed
among processors by partitioning both clusters and samples
of fields at each level. Due to the improved load-balancing, the
hierarchical strategy offers a higher parallelization efficiency than
previous approaches, especially when the number of processors
is large. We demonstrate the improved efficiency on scattering
problems discretized with millions of unknowns. In addition, we
present the effectiveness of our algorithm by solving very large
scattering problems involving a conducting sphere of radius 210
wavelengths and a complicated real-life target with a maximum
dimension of 880 wavelengths. Both of the objects are discretized
with more than 200 million unknowns
- …