187 research outputs found
Parallel and Flow-Based High Quality Hypergraph Partitioning
Balanced hypergraph partitioning is a classic NP-hard optimization problem that is a fundamental tool in such diverse disciplines as VLSI circuit design, route planning, sharding distributed databases, optimizing communication volume in parallel computing, and accelerating the simulation of quantum circuits.
Given a hypergraph and an integer , the task is to divide the vertices into disjoint blocks with bounded size, while minimizing an objective function on the hyperedges that span multiple blocks.
In this dissertation we consider the most commonly used objective, the connectivity metric, where we aim to minimize the number of different blocks connected by each hyperedge.
The most successful heuristic for balanced partitioning is the multilevel approach, which consists of three phases.
In the coarsening phase, vertex clusters are contracted to obtain a sequence of structurally similar but successively smaller hypergraphs.
Once sufficiently small, an initial partition is computed.
Lastly, the contractions are successively undone in reverse order, and an iterative improvement algorithm is employed to refine the projected partition on each level.
An important aspect in designing practical heuristics for optimization problems is the trade-off between solution quality and running time.
The appropriate trade-off depends on the specific application, the size of the data sets, and the computational resources available to solve the problem.
Existing algorithms are either slow, sequential and offer high solution quality, or are simple, fast, easy to parallelize, and offer low quality.
While this trade-off cannot be avoided entirely, our goal is to close the gaps as much as possible.
We achieve this by improving the state of the art in all non-trivial areas of the trade-off landscape with only a few techniques, but employed in two different ways.
Furthermore, most research on parallelization has focused on distributed memory, which neglects the greater flexibility of shared-memory algorithms and the wide availability of commodity multi-core machines.
In this thesis, we therefore design and revisit fundamental techniques for each phase of the multilevel approach, and develop highly efficient shared-memory parallel implementations thereof.
We consider two iterative improvement algorithms, one based on the Fiduccia-Mattheyses (FM) heuristic, and one based on label propagation.
For these, we propose a variety of techniques to improve the accuracy of gains when moving vertices in parallel, as well as low-level algorithmic improvements.
For coarsening, we present a parallel variant of greedy agglomerative clustering with a novel method to resolve cluster join conflicts on-the-fly.
Combined with a preprocessing phase for coarsening based on community detection, a portfolio of from-scratch partitioning algorithms, as well as recursive partitioning with work-stealing, we obtain our first parallel multilevel framework.
It is the fastest partitioner known, and achieves medium-high quality, beating all parallel partitioners, and is close to the highest quality sequential partitioner.
Our second contribution is a parallelization of an n-level approach, where only one vertex is contracted and uncontracted on each level.
This extreme approach aims at high solution quality via very fine-grained, localized refinement, but seems inherently sequential.
We devise an asynchronous n-level coarsening scheme based on a hierarchical decomposition of the contractions, as well as a batch-synchronous uncoarsening, and later fully asynchronous uncoarsening.
In addition, we adapt our refinement algorithms, and also use the preprocessing and portfolio.
This scheme is highly scalable, and achieves the same quality as the highest quality sequential partitioner (which is based on the same components), but is of course slower than our first framework due to fine-grained uncoarsening.
The last ingredient for high quality is an iterative improvement algorithm based on maximum flows.
In the sequential setting, we first improve an existing idea by solving incremental maximum flow problems, which leads to smaller cuts and is faster due to engineering efforts.
Subsequently, we parallelize the maximum flow algorithm and schedule refinements in parallel.
Beyond the strive for highest quality, we present a deterministically parallel partitioning framework.
We develop deterministic versions of the preprocessing, coarsening, and label propagation refinement.
Experimentally, we demonstrate that the penalties for determinism in terms of partition quality and running time are very small.
All of our claims are validated through extensive experiments, comparing our algorithms with state-of-the-art solvers on large and diverse benchmark sets.
To foster further research, we make our contributions available in our open-source framework Mt-KaHyPar.
While it seems inevitable, that with ever increasing problem sizes, we must transition to distributed memory algorithms, the study of shared-memory techniques is not in vain.
With the multilevel approach, even the inherently slow techniques have a role to play in fast systems, as they can be employed to boost quality on coarse levels at little expense.
Similarly, techniques for shared-memory parallelism are important, both as soon as a coarse graph fits into memory, and as local building blocks in the distributed algorithm
Open Problems in (Hyper)Graph Decomposition
Large networks are useful in a wide range of applications. Sometimes problem
instances are composed of billions of entities. Decomposing and analyzing these
structures helps us gain new insights about our surroundings. Even if the final
application concerns a different problem (such as traversal, finding paths,
trees, and flows), decomposing large graphs is often an important subproblem
for complexity reduction or parallelization. This report is a summary of
discussions that happened at Dagstuhl seminar 23331 on "Recent Trends in Graph
Decomposition" and presents currently open problems and future directions in
the area of (hyper)graph decomposition
Data Tiling for Sparse Computation
Many real-world data contain internal relationships. Efficient analysis of these relationship data is crucial for important problems including genome alignment, network vulnerability analysis, ranking web pages, among others. Such relationship data is frequently sparse and analysis on it is called sparse computation. We demonstrate that the important technique of data tiling is more powerful than previously known by broadening its application space. We focus on three important sparse computation areas: graph analysis, linear algebra, and bioinformatics. We demonstrate data tiling's power by addressing key issues and providing significant improvements---to both runtime and solution quality---in each area. For graph analysis, we focus on fast data tiling techniques that can produce well-structured tiles and demonstrate theoretical hardness results. These tiles are suitable for graph problems as they reduce data movement and ultimately improve end-to-end runtime performance. For linear algebra, we introduce a new cache-aware tiling technique and apply it to the key kernel of sparse matrix by sparse matrix multiplication. This technique tiles the second input matrix and then uses a small, summary matrix to guide access to the tiles during computation. Our approach results in the fastest known implementation across three distinct CPU architectures. In bioinformatics, we develop a tiling based de novo genome assembly pipeline. We start with reads and develop either a graph or hypergraph that captures internal relationships between reads. This is then tiled to minimize connections while maintaining balance. We then treat each resulting tile independently as the input to an existing, shared-memory assembler. Our pipeline improves existing state-of-the-art de novo genome assemblers and brings both runtime and quality improvements to them on both real-world and simulated datasets.Ph.D
Jet: Multilevel Graph Partitioning on GPUs
The multilevel heuristic is the dominant strategy for high-quality sequential
and parallel graph partitioning. Partition refinement is a key step of
multilevel graph partitioning. In this work, we present Jet, a new parallel
algorithm for partition refinement specifically designed for Graphics
Processing Units (GPUs). We combine Jet with GPU-aware coarsening to develop a
-way graph partitioner. The new partitioner achieves superior quality when
compared to state-of-the-art shared memory graph partitioners on a large
collection of test graphs.Comment: Submitted as a non-archival track paper for SIAM ACDA 202
Partitioning Hypergraphs is Hard: Models, Inapproximability, and Applications
We study the balanced -way hypergraph partitioning problem, with a special
focus on its practical applications to manycore scheduling. Given a hypergraph
on nodes, our goal is to partition the node set into parts of size at
most each, while minimizing the cost of the
partitioning, defined as the number of cut hyperedges, possibly also weighted
by the number of partitions they intersect. We show that this problem cannot be
approximated to within a factor of the optimal
solution in polynomial time if the Exponential Time Hypothesis holds, even for
hypergraphs of maximal degree 2. We also study the hardness of the partitioning
problem from a parameterized complexity perspective, and in the more general
case when we have multiple balance constraints.
Furthermore, we consider two extensions of the partitioning problem that are
motivated from practical considerations. Firstly, we introduce the concept of
hyperDAGs to model precedence-constrained computations as hypergraphs, and we
analyze the adaptation of the balanced partitioning problem to this case.
Secondly, we study the hierarchical partitioning problem to model hierarchical
NUMA (non-uniform memory access) effects in modern computer architectures, and
we show that ignoring this hierarchical aspect of the communication cost can
yield significantly weaker solutions.Comment: Published in the 35th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA 2023
FREIGHT: Fast Streaming Hypergraph Partitioning
Partitioning the vertices of a (hyper)graph into k roughly balanced blocks such that few (hyper)edges run between blocks is a key problem for large-scale distributed processing. A current trend for partitioning huge (hyper)graphs using low computational resources are streaming algorithms. In this work, we propose FREIGHT: a Fast stREamInG Hypergraph parTitioning algorithm which is an adaptation of the widely-known graph-based algorithm Fennel. By using an efficient data structure, we make the overall running of FREIGHT linearly dependent on the pin-count of the hypergraph and the memory consumption linearly dependent on the numbers of nets and blocks. The results of our extensive experimentation showcase the promising performance of FREIGHT as a highly efficient and effective solution for streaming hypergraph partitioning. Our algorithm demonstrates competitive running time with the Hashing algorithm, with a difference of a maximum factor of four observed on three fourths of the instances. Significantly, our findings highlight the superiority of FREIGHT over all existing (buffered) streaming algorithms and even the in-memory algorithm HYPE, with respect to both cut-net and connectivity measures. This indicates that our proposed algorithm is a promising hypergraph partitioning tool to tackle the challenge posed by large-scale and dynamic data processing
A Comprehensive Review of Community Detection in Graphs
The study of complex networks has significantly advanced our understanding of
community structures which serves as a crucial feature of real-world graphs.
Detecting communities in graphs is a challenging problem with applications in
sociology, biology, and computer science. Despite the efforts of an
interdisciplinary community of scientists, a satisfactory solution to this
problem has not yet been achieved. This review article delves into the topic of
community detection in graphs, which serves as a crucial role in understanding
the organization and functioning of complex systems. We begin by introducing
the concept of community structure, which refers to the arrangement of vertices
into clusters, with strong internal connections and weaker connections between
clusters. Then, we provide a thorough exposition of various community detection
methods, including a new method designed by us. Additionally, we explore
real-world applications of community detection in diverse networks. In
conclusion, this comprehensive review provides a deep understanding of
community detection in graphs. It serves as a valuable resource for researchers
and practitioners in multiple disciplines, offering insights into the
challenges, methodologies, and applications of community detection in complex
networks
Streaming, Local, and MultiLevel (Hyper)Graph Decomposition
(Hyper)Graph decomposition is a family of problems that aim to break down large (hyper)graphs into smaller sub(hyper)graphs for easier analysis. The importance of this lies in its ability to enable efficient computation on large and complex (hyper)graphs, such as social networks, chemical compounds, and computer networks. This dissertation explores several types of (hyper)graph decomposition problems, including graph partitioning, hypergraph partitioning, local graph clustering, process mapping, and signed graph clustering. Our main focus is on streaming algorithms, local algorithms and multilevel algorithms. In terms of streaming algorithms, we make contributions with highly efficient and effective algorithms for (hyper)graph partitioning and process mapping. In terms of local algorithms, we propose sub-linear algorithms which are effective in detecting high-quality local communities around a given seed node in a graph based on the distribution of a given motif. In terms of multilevel algorithms, we engineer high-quality multilevel algorithms for process mapping and signed graph clustering. We provide a thorough discussion of each algorithm along with experimental results demonstrating their superiority over existing state-of-the-art techniques.
The results show that the proposed algorithms achieve improved performance and better solutions in various metrics, making them highly promising for practical applications. Overall, this dissertation showcases the effectiveness of advanced combinatorial algorithmic techniques in solving challenging (hyper)graph decomposition problems
Exploiting data locality in cache-coherent NUMA systems
The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips.
Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours.
Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer.
These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems.
The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine.
The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a
NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs.
Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia
de reloj de los computadores. Con el objetivo de superar este problema,
durante las últimas dos décadas los fabricantes han integrado más unidades
de cómputo en los sistemas mediante la interconexión de nodos diferentes,
la inclusión de múltiples chips en los nodos y el incremento de núcleos
de procesador en cada chip. La rapidez de la memoria principal no ha
evolucionado con el mismo factor que los procesadores; es mucho más lenta
y hay la necesidad de proporcionar más ancho de banda a los procesadores,
especialmente con el incremento del número de núcleos y chips.
Aun manteniendo un sistema de direccionamiento compartido en el que
todos los procesadores pueden acceder al conjunto de la memoria, las soluciones
han oscilado alrededor de la integración de más memorias: usando
tecnologías modernas como las memorias de alto ancho de banda (highbandwidth
memories, HBM) y memorias no volátiles (non-volatile memories,
NVM), haciendo que grupos de núcleos (como zócalos completos) tengan
acceso más veloz a un subconjunto de la DRAM, o con la combinación de
soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso
a la memoria principal, en función del núcleo que solicita el acceso a una
dirección de memoria en particular y la ubicación física de esta dirección, lo
que provoca unos comportamientos no uniformes en el acceso a la memoria
(non-uniform memory access, NUMA). Además, muchos de estos sistemas
tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo
que implica que cualquier cambio hecho en la memoria desde un núcleo
de un procesador debe ser visible por el resto de procesadores de forma
transparente para los programadores.
Estos comportamientos NUMA reducen el rendimiento de las aplicaciones
y pueden suponer un reto para los programadores. Para abordar dicho problema,
en esta tesis se proponen soluciones, a nivel de software y hardware,
que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia,
el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez
la precarga de direcciones de memoria mediante hardware (o hardware
prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los
datos en los sistemas NUMA, se pueden hallar mejores configuraciones que
cuando se consideran ambos aspectos por separado. Con una combinación
de los resultados de rendimiento y de los contadores disponibles en el
sistema se construye un modelo de rendimiento, tanto por avanzado como
en en tiempo de ejecución, de la mejor configuración para aplicaciones que
no están incluidas en el modelo. La evaluación se realiza en dos sistemas
NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas
se usan para predecir las mejores configuraciones en el otro sistema.
La segunda contribución se basa en la idea de que el prefetching puede
tener un efecto considerable en los sistemas NUMA y propone un esquema
de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este
esquema es genérico y se puede aplicar a diferentes algoritmos de precarga
existentes con un coste de hardware muy bajo pero que proporciona muy
buenos resultados. Dichos resultados se obtienen y evalúan mediante un
simulador arquitectural preciso a nivel de ciclo y proporciona resultados
detallados del rendimiento, la reducción de las comunicaciones de datos y
los costes energéticos.
Finalmente, la tercera y última contribución consiste en algoritmos de planificación
para modelos de programación basados en tareas. Estos modelos
simplifican la programabilidad de las aplicaciones paralelas y proveen información
muy útil al sistema en tiempo de ejecución (runtime system)
que controla su funcionamiento. Esta información se utiliza para construir
un grafo de dependencias entre tareas (task dependency graph, TDG), un
grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son
fragmentos de código secuencial, conocidos como tareas, y los arcos son las
dependencias de datos entre las distintas tareas. Los algoritmos de planificación
que se proponen usan técnicas e particionado de grafos y proporcionan
una planificación de las tareas del TDG que minimiza la comunicación de
datos entre las distintas regiones NUMA del sistema. Los resultados se han
evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version
- …