358 research outputs found
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory
With the growing reliance of modern supercomputers on accelerator-based
architectures such a GPUs, the development and optimization of electronic
structure methods to exploit these massively parallel resources has become a
recent priority. While significant strides have been made in the development of
GPU accelerated, distributed memory algorithms for many-body (e.g.
coupled-cluster) and spectral single-body (e.g. planewave, real-space and
finite-element density functional theory [DFT]), the vast majority of
GPU-accelerated Gaussian atomic orbital methods have focused on shared memory
systems with only a handful of examples pursuing massive parallelism on
distributed memory GPU architectures. In the present work, we present a set of
distributed memory algorithms for the evaluation of the Coulomb and
exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via
direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods,
respectively. The absolute performance and strong scalability of the developed
methods are demonstrated on systems ranging from a few hundred to over one
thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter
supercomputer.Comment: 45 pages, 9 figure
Optimization of Computationally and I/O Intense Patterns in Electronic Structure and Machine Learning Algorithms
Development of scalable High-Performance Computing (HPC) applications is already a challenging task even in the
pre-Exascale era. Utilization of the full potential of (near-)future supercomputers will most likely require the mastery
of massively parallel heterogeneous architectures with multi-tier persistence systems, ideally in fault tolerant mode.
With the change in hardware architectures HPC applications are also widening their scope to `Big data' processing and
analytics using machine learning algorithms and neural networks. In this work, in cooperation with the INTERTWinE
FET-HPC project, we demonstrate how the GASPI (Global Address Space Programming Interface) programming model
helps to address these Exascale challenges on examples of tensor contraction, K-means and Terasort algorithms
Scalable High-Quality Graph and Hypergraph Partitioning
The balanced hypergraph partitioning problem (HGP) asks for a partition of the node set
of a hypergraph into blocks of roughly equal size, such that an objective function defined
on the hyperedges is minimized. In this work, we optimize the connectivity metric,
which is the most prominent objective function for HGP.
The hypergraph partitioning problem is NP-hard and there exists no constant factor approximation.
Thus, heuristic algorithms are used in practice with the multilevel scheme as
the most successful approach to solve the problem: First, the input hypergraph is coarsened to
obtain a hierarchy of successively smaller and structurally similar approximations.
The smallest hypergraph is then initially partitioned into blocks, and subsequently,
the contractions are reverted level-by-level, and, on each level, local search algorithms are used
to improve the partition (refinement phase).
In recent years, several new techniques were developed for sequential multilevel partitioning
that substantially improved solution quality at the cost of an increased running time.
These developments divide the landscape of existing partitioning algorithms into systems that either aim for
speed or high solution quality with the former often being more than an order of magnitude faster
than the latter. Due to the high running times of the best sequential algorithms, it is currently not
feasible to partition the largest real-world hypergraphs with the highest possible quality.
Thus, it becomes increasingly important to parallelize the techniques used in these algorithms.
However, existing state-of-the-art parallel partitioners currently do not achieve the same solution
quality as their sequential counterparts because they use comparatively weak components that are easier to parallelize.
Moreover, there has been a recent trend toward simpler methods for partitioning large hypergraphs
that even omit the multilevel scheme.
In contrast to this development, we present two shared-memory multilevel hypergraph partitioners
with parallel implementations of techniques used by the highest-quality sequential systems.
Our first multilevel algorithm uses a parallel clustering-based coarsening scheme,
which uses substantially fewer locking mechanisms than previous approaches.
The contraction decisions are guided by the community structure of the input hypergraph
obtained via a parallel community detection algorithm.
For initial partitioning, we implement parallel multilevel recursive bipartitioning with a
novel work-stealing approach and a portfolio of initial bipartitioning techniques to
compute an initial solution. In the refinement phase, we use three different parallel improvement
algorithms: label propagation refinement, a highly-localized direct -way
FM algorithm, and a novel parallelization of flow-based refinement.
These algorithms build on our highly-engineered partition data structure, for which we propose
several novel techniques to compute accurate gain values of node moves in the parallel setting.
Our second multilevel algorithm parallelizes the -level partitioning scheme used in
the highest-quality sequential partitioner KaHyPar. Here, only a single node
is contracted on each level, leading to a hierarchy with approximately levels where
is the number of nodes. Correspondingly, in each refinement step, only a single node is uncontracted, allowing
a highly-localized search for improvements.
We show that this approach, which seems inherently sequential, can be parallelized efficiently without compromises in solution quality.
To this end, we design a forest-based representation of contractions from which we derive a feasible parallel
schedule of the contraction operations that we apply on a novel dynamic hypergraph data structure on-the-fly.
In the uncoarsening phase, we decompose the contraction forest into batches, each containing
a fixed number of nodes. We then uncontract each batch in parallel and use highly-localized
versions of our refinement algorithms to improve the partition around the uncontracted nodes.
We further show that existing sequential partitioning algorithms considerably struggle to find balanced partitions
for weighted real-world hypergraphs. To this end, we present a technique that enables partitioners based on recursive
bipartitioning to reliably compute balanced solutions. The idea is to preassign a small portion of the
heaviest nodes to one of the two blocks of each bipartition and optimize the objective function on the
remaining nodes. We integrated the approach into the sequential hypergraph partitioner KaHyPar
and show that our new approach can compute balanced solutions for all tested instances without negatively affecting the solution
quality and running time of KaHyPar.
In our experimental evaluation, we compare our new shared-memory (hyper)graph partitioner Mt-KaHyPar
to different graph and hypergraph partitioners on over (hyper)graphs with up to two billion edges/pins.
The results indicate that already our fastest configuration outperforms almost all existing
hypergraph partitioners with regards to both solution quality and running time. Our highest-quality configuration
(-level with flow-based refinement) achieves the same solution quality as the currently best
sequential partitioner KaHyPar, while being almost an order of magnitude faster with ten threads.
In addition, we optimize our data structures for graph partitioning, which improves the running times of both multilevel partitioners by
almost a factor of two for graphs. As a result, Mt-KaHyPar also outperforms most of the existing
graph partitioning algorithms. While the shared-memory graph partitioner KaMinPar is still faster than
Mt-KaHyPar, its produced solutions are worse by in the median. The best sequential graph
partitioner KaFFPa-StrongS computes slightly better partitions than Mt-KaHyPar
(median improvement is ), but is more than an order of magnitude slower on average
Research summary, January 1989 - June 1990
The Research Institute for Advanced Computer Science (RIACS) was established at NASA ARC in June of 1983. RIACS is privately operated by the Universities Space Research Association (USRA), a consortium of 62 universities with graduate programs in the aerospace sciences, under a Cooperative Agreement with NASA. RIACS serves as the representative of the USRA universities at ARC. This document reports our activities and accomplishments for the period 1 Jan. 1989 - 30 Jun. 1990. The following topics are covered: learning systems, networked systems, and parallel systems
- …