    Parallel Graph Partitioning for Complex Networks

    Processing large complex networks like social networks or web graphs has recently attracted considerable interest. In order to do this in parallel, we need to partition them into pieces of about equal size. Unfortunately, previous parallel graph partitioners originally developed for more regular mesh-like networks do not work well for these networks. This paper addresses this problem by parallelizing and adapting the label propagation technique originally developed for graph clustering. By introducing size constraints, label propagation becomes applicable for both the coarsening and the refinement phase of multilevel graph partitioning. We obtain very high quality by applying a highly parallel evolutionary algorithm to the coarsened graph. The resulting system is both more scalable and achieves higher quality than state-of-the-art systems like ParMetis or PT-Scotch. For large complex networks the performance differences are very big. For example, our algorithm can partition a web graph with 3.3 billion edges in less than sixteen seconds using 512 cores of a high performance cluster while producing a high quality partition -- none of the competing systems can handle this graph on our system.Comment: Review article. Parallelization of our previous approach arXiv:1402.328

    Upper bounds on the bisection width of 3- and 4-regular graphs

    AbstractWe derive new upper bounds on the bisection width of graphs which have a regular vertex degree. We show that the bisection width of sufficiently large 3-regular graphs with |V| vertices is at most (16+ε)|V|, ε>0. For the bisection width of sufficiently large 4-regular graphs we show an upper bound of (25+ε)|V|, ε>0

    Accelerating shape optimizing load balancing for parallel FEM simulations by algebraic multigrid

    We propose a load balancing heuristic for parallel adaptive finite element method (FEM) simulations. In contrast to most existing approaches, the heuristic fo-cuses on good partition shapes rather than on mini-mizing the classical edge-cut metric. By applying Alge-braic Multigrid (AMG), we are able to speed up the two most time consuming calculations of the approach while maintaining its large amount of natural parallelism

    Estrategias de descomposición en dominios para entornos Grid

    En este trabajo estamos interesados en realizar simulaciones numéricas basadas en elementos finitos con integración explícita en el tiempo utilizando la tecnología Grid.Actualmente, las simulaciones explícitas de elementos finitos usan la técnica de descomposición en dominios con particiones balanceadas para realizar la distribución de los datos. Sin embargo, esta distribución de los datos presenta una degradación importante del rendimiento de las simulaciones explícitas cuando son ejecutadas en entornos Grid. Esto se debe principalmente, a que en un ambiente Grid tenemos comunicaciones heterogéneas, muy rápidas dentro de una máquina y muy lentas fuera de ella. De esta forma, una distribución balanceada de los datos se ejecuta a la velocidad de las comunicaciones más lentas. Para superar este problema proponemos solapar el tiempo de la comunicación remota con el tiempo de cálculo. Para ello, dedicaremos algunos procesadores a gestionar las comunicaciones más lentas, y el resto, a realizar cálculo intensivo. Este esquema de distribución de los datos, requiere que la descomposición en dominios sea no balanceada, para que, los procesadores dedicados a realizar la gestión de las comunicaciones lentas tengan apenas carga computacional. En este trabajo se han propuesto y analizado diferentes estrategias para distribuir los datos y mejorar el rendimiento de las aplicaciones en entornos Grid. Las estrategias de distribución estáticas analizadas son: 1. U-1domains: Inicialmente, el dominio de los datos es dividido proporcionalmente entre las máquinas dependiendo de su velocidad relativa. Posteriormente, en cada máquina, los datos son divididos en nprocs-1 partes, donde nprocs es el número de procesadores total de la máquina. Cada subdominio es asignado a un procesador y cada máquina dispone de un único procesador para gestionar las comunicaciones remotas con otras máquinas. 2. U-Bdomains: El particionamiento de los datos se realiza en dos fases. La primera fase es equivalente a la realizada para la distribución U-1domains. La segunda fase, divide, proporcionalmente, cada subdominio de datos en nprocs-B partes, donde B es el número de comunicaciones remotas con otras máquinas (dominios especiales). Cada máquina tiene más de un procesador para gestionar las comunicaciones remotas. 3. U-CBdomains: En esta distribución, se crean tantos dominios especiales como comunicaciones remotas. Sin embargo, ahora los dominios especiales son asignados a un único procesador dentro de la máquina. De esta forma, cada subdomino de datos es dividido en nprocs-1 partes. La gestión de las comunicaciones remotas se realiza concurrentemente mediante threads. Para evaluar el rendimiento de las aplicaciones sobre entornos Grid utilizamos Dimemas. Para cada caso, evaluamos el rendimiento de las aplicaciones en diferentes entornos y tipos de mallas. Los resultados obtenidos muestran que:· La distribución U-1domains reduce los tiempos de ejecución hasta un 45% respecto a la distribución balanceada. Sin embargo, esta distribución no resulta efectiva para entornos Grid compuestos de una gran cantidad de máquinas remotas.· La distribución U-Bdomains muestra ser más eficiente, ya que reduce el tiempo de ejecución hasta un 53%. Sin embargo, la escalabilidad de ésta distribución es moderada, debido a que puede llegar a tener un gran número de procesadores que no realizan cálculo intensivo. Estos procesadores únicamente gestionan las comunicaciones remotas. Como limite sólo podemos aplicar esta distribución si más del 50% de los procesadores en una máquina realizan cálculo.· La distribución U-CBdomains reduce los tiempos de ejecución hasta 30%, pero no resulta tan efectiva como la distribución U-Bdomains. Sin embargo, esta distribución incrementa la utilización de los procesadores en 50%, es decir que disminuye los procesadores ociosos

    Quality Matching and Local Improvement for Multilevel Graph-Partitioning

    Multilevel strategies have proven to be very powerful approaches in order to partition graphs efficiently. Their efficiency is dominated by two parts; the coarsening and the local improvement strategies. Several methods have been developed to solve these problems, but their efficiency has only been proven on an experimental basis. In this paper we present new and efficient methods for both problems, while satisfying certain quality measurements. For the coarsening part we develop a new approximation algorithm for maximum weighted matching in general edge-weighted graphs. It calculates a matching with an edge weight of at least 1 2 of the edge weight of a maximum weighted matching. Its time complexity is O(jEj), with jEj being the number of edges in the graph. Furthermore, we use the Helpful-Set strategy for the local improvement of partitions. For partitioning graphs with a regular degree of 2k into 2 parts, it guarantees an upper bound of k\Gamma1 2 jV j + 1 on the cut size of th..

    Faktorisierung dünn besetzter, positiv definiter Matrizen

    von Jürgen SchulzePaderborn, Univ., Diss., 200

    Parallel and External High Quality Graph Partitioning

    Partitioning graphs into k blocks of roughly equal size such that few edges run between the blocks is a key tool for processing and analyzing large complex real-world networks. The graph partitioning problem has multiple practical applications in parallel and distributed computations, data storage, image processing, VLSI physical design and many more. Furthermore, recently, size, variety, and structural complexity of real-world networks has grown dramatically. Therefore, there is a demand for efficient graph partitioning algorithms that fully utilize computational power and memory capacity of modern machines. A popular and successful heuristic to compute a high-quality partitions of large networks in reasonable time is multi-level graph partitioning\textit{multi-level graph partitioning} approach which contracts the graph preserving its structure and then partitions it using a complex graph partitioning algorithm. Specifically, the multi-level graph partitioning approach consists of three main phases: coarsening, initial partitioning, and uncoarsening. During the coarsening phase, the graph is recursively contracted preserving its structure and properties until it is small enough to compute its initial partition during the initial partitioning phase. Afterwards, during the uncoarsening phase the partition of the contracted graph is projected onto the original graph and refined using, for example, local search. Most of the research on heuristical graph partitioning focuses on sequential algorithms or parallel algorithms in the distributed memory model. Unfortunately, previous approaches to graph partitioning are not able to process large networks and rarely take in into account several aspects of modern computational machines. Specifically, the amount of cores per chip grows each year as well as the price of RAM reduces slower than the real-world graphs grow. Since HDDs and SSDs are 50 – 400 times cheaper than RAM, external memory makes it possible to process large real-world graphs for a reasonable price. Therefore, in order to better utilize contemporary computational machines, we develop efficient multi-level graph partitioning\textit{multi-level graph partitioning} algorithms for the shared-memory and the external memory models. First, we present an approach to shared-memory parallel multi-level graph partitioning that guarantees balanced solutions, shows high speed-ups for a variety of large graphs and yields very good quality independently of the number of cores used. Important ingredients include parallel label propagation for both coarsening and uncoarsening, parallel initial partitioning, a simple yet effective approach to parallel localized local search, and fast locality preserving hash tables that effectively utilizes caches. The main idea of the parallel localized local search is that each processors refines only a small area around a random vertex reducing interactions between processors. For example, on 79 cores, our algorithms partitions a graph with more than 3 billions of edges into 16 blocks cutting 4.5% less edges than the closest competitor and being more than two times faster. Furthermore, another competitors is not able to partition this graph. We then present an approach to external memory graph partitioning that is able to partition large graphs that do not fit into RAM. Specifically, we consider the semi-external and the external memory model. In both models a data structure of size proportional to the number of edges does not fit into the RAM. The difference is that the former model assumes that a data structure of size proportional to the number of vertices fits into the RAM whereas the latter assumes the opposite. We address the graph partitioning problem in both models by adapting the size-constrained label propagation technique for the semi-external model and by developing a size-constrained clustering algorithm based on graph coloring in the external memory. Our semi-external size-constrained label propagation algorithm (or external memory clustering algorithm) can be used to compute graph clusterings and is a prerequisite for the (semi-)external graph partitioning algorithm. The algorithms are then used for both the coarsening and the uncoarsening phase of a multi-level algorithm to compute graph partitions. Our (semi-)external algorithm is able to partition and cluster huge complex networks with billions of edges on cheap commodity machines. Experiments demonstrate that the semi-external graph partitioning algorithm is scalable and can compute high quality partitions in time that is comparable to the running time of an efficient internal memory implementation. A parallelization of the algorithm in the semi-external model further reduces running times. Additionally, we develop a speed-up technique for the hypergraph partitioning algorithms. Hypergraphs are an extension of graphs that allow a single edge to connect more than two vertices. Therefore, they describe models and processes more accurately additionally allowing more possibilities for improvement. Most multi-level hypergraph partitioning algorithms perform some computations on vertices and their set of neighbors. Since these computations can be super-linear, they have a significant impact on the overall running time on large hypergraphs. Therefore, to further reduce the size of hyperedges, we develop a pin-sparsifier based on the min-hash technique that clusters vertices with similar neighborhood. Further, vertices that belong to the same cluster are substituted by one vertex, which is connected to their neighbors, therefore, reducing the size of the hypergraph. Our algorithm sparsifies a hypergraph such that the resulting graph can be partitioned significantly faster without loss in quality (or with insignificant loss). On average, KaHyPar with sparsifier performs partitioning about 1.5 times faster while preserving solution quality if hyperedges are large. All aforementioned frameworks are publicly available