16 research outputs found

    HARP: A Dynamic Inertial Spectral Partitioner

    Get PDF
    Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effecfive in generating high-quality partitions of realistically-sized meshes. The major problem which hindered their wide-spread use was their long execution times. This paper presents a new inertial spectral partitioner, called HARP. The main objective of the proposed approach is to quickly partition the meshes at runtime in a manner that works efficiently for real applications in the context of distributed-memory machines. The underlying principle of HARP is to find the eigenvectors of the unpartitioned vertices and then project them onto the eigerivectors of the original mesh. Results for various meshes ranging in size from 1000 to 100,000 vertices indicate that HARP can indeed partition meshes rapidly at runtime. Experimental results show that our largest mesh can be partitioned sequentially in only a few seconds on an SP2 which is several times faster than other spectral partitioners while maintaining the solution quality of the proven RSB method. A parallel WI version of HARP has also been implemented on IBM SP2 and Cray T3E. Parallel HARP, running on 64 processors SP2 and T3E, can partition a mesh containing more than 100,000 vertices into 64 subgrids in about half a second. These results indicate that graph partitioning can now be truly embedded in dynamically-changing real-world applications

    RIACS

    Get PDF
    Topics considered include: high-performance computing; cognitive and perceptual prostheses (computational aids designed to leverage human abilities); autonomous systems. Also included: development of a 3D unstructured grid code based on a finite volume formulation and applied to the Navier-stokes equations; Cartesian grid methods for complex geometry; multigrid methods for solving elliptic problems on unstructured grids; algebraic non-overlapping domain decomposition methods for compressible fluid flow problems on unstructured meshes; numerical methods for the compressible navier-stokes equations with application to aerodynamic flows; research in aerodynamic shape optimization; S-HARP: a parallel dynamic spectral partitioner; numerical schemes for the Hamilton-Jacobi and level set equations on triangulated domains; application of high-order shock capturing schemes to direct simulation of turbulence; multicast technology; network testbeds; supercomputer consolidation project

    Transparent load balancing of MPI programs using OmpSs-2@Cluster and DLB

    Get PDF
    Load imbalance is a long-standing source of inefficiency in high performance computing. The situation has only got worse as applications and systems increase in complexity, e.g., adaptive mesh refinement, DVFS, memory hierarchies, power and thermal management, and manufacturing processes. Load balancing is often implemented in the application, but it obscures application logic and may need extensive code refactoring. This paper presents an automated and transparent dynamic load balancing approach for MPI applications with OmpSs-2 tasks, which relieves applications from this burden. Only local and trivial changes are required to the application. Our approach exploits the ability of OmpSs-2@Cluster to offload tasks for execution on other nodes, and it reallocates compute resources among ranks using the Dynamic Load Balancing~(DLB) library. It employs LeWI to react to fine-grained load imbalances and DROM to address coarse-grained load imbalances by reserving cores on other nodes that can be reclaimed on demand. We use an expander graph to limit the amount of point-to-point communication and state. The results show 46% reduction in time-to-solution for micro-scale solid mechanics on 32 nodes and a 20% reduction beyond DLB for nn-body on 16 nodes, when one node is running slow. A synthetic benchmark shows that performance is within 10% of optimal for an imbalance of up to 2.0 on 8 nodes. All software is released open source.This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA) and 754337 (EuroEXA). It is supported by the Spanish State Research Agency - Ministry of Science and Innovation (contract PID2019-107255GB and Ramon y Cajal fellowship RYC2018-025628-I) and by the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    High-Performance Graph Algorithms

    Get PDF
    Ever since Euler took a stroll across the bridges of a city then called Königsberg, relationships of entities have been modeled as graphs. Being a useful abstraction when the structure of relationships is the significant aspect of a problem, popular uses of graphs include the modeling of social networks, supply chain dependencies, the internet, customer preferences, street networks or the who-eats-whom (aka food network) in an ecosystem. In parallel computing, the assignment of sub-tasks to processes can massively influence the performance, since data dependencies between processes are significantly more expensive than within them. This scenario has been profitably modeled as a graph problem, with sub-tasks as vertices and their communication dependencies as edges. Many graphs are governed by an underlying geometry. Some are derived directly from a geometric object, such as street networks or meshes from spatial simulations. Others have a hidden geometry, in which no explicit geometry information is known, but the graph structure follows an underlying geometry. A suitable embedding into this geometry would then show a close relationship between graph distances and geometric distances. A subclass of graphs enjoying significant attention are complex networks. Though hard to define exactly, they are commonly characterized by a low diameter, heterogeneous structure, and a skewed degree distribution, often following a power law. The most famous examples include social networks, the hyperlink network and communication networks. Development of new analysis algorithms for complex networks is ongoing. Especially since the instances of interest are often in the size of billions of vertices, fast analysis algorithms and approximations are a natural focus of development. To accurately test and benchmark new developments, as well as to gain theoretical insight about network formation, generative graph models are required: A mathematical model describing a family of random graphs, from which instances can be sampled efficiently. Even if real test data is available, interesting instances are often in the size of terabytes, making storage and transmission inconvenient. While the underlying geometry of street networks is the spherical geometry they were built in, there is some evidence that the geometry best fitting to complex networks is not Euclidean or spherical, but hyperbolic. Based on this notion, Krioukov et al. proposed a generative graph model for complex networks, called Random Hyperbolic Graphs. They are created by setting vertices randomly within a disk in the hyperbolic plane and connecting pairs of vertices with a probability depending on their distance. An important subclass of this model, called Threshold Random Hyperbolic Graphs connects vertices exactly if the distances between vertices is below a threshold. This model has pleasant properties and has received considerable attention from theoreticians. Unfortunately, a straightforward generation algorithm has a complexity quadratic in the number of nodes, which renders it infeasible for instances of more than a few million vertices. We developed four faster generation algorithms for random hyperbolic graphs: By projecting hyperbolic geometry in the Euclidean geometry of the Poincaré disk model, we are able to use adapted versions of existing geometric data structures. Our first algorithm uses a polar quadtree to avoid distance calculations and achieves a time complexity of O((n3/2+m)logn)\mathcal{O}((n^{3/2} + m)\log n) whp. -- the first subquadratic generation algorithm for threshold random hyperbolic graphs. Empirically, our implementation achieves an improvement of three orders of magnitude over a reference implementation of the straightforward algorithm. We extend this quadtree data structure further for the generation of general random hyperbolic graphs, in which all edges are probabilistic. Since each edge has a non-zero probability of existing, sampling them by throwing a biased coin for each would again cost quadratic time complexity. We address this issue by sampling jumping widths within leaf cells and aggregating subtrees to virtual leaf cells when the expected number of neighbors in them is less than a threshold. With this tradeoff, we bound both the number of distance calculations and the number of examined quadtree cells per edge, resulting in the same time complexity of O((n3/2+m)logn)\mathcal{O}((n^{3/2} + m)\log n) also for general random hyperbolic graphs. We generalize this sampling scenario and define Probabilistic Neighborhood Queries, in which a random sample of a geometric point set is desired, with the probability of inclusion depending on the distance to a query point. Usable to simulate probabilistic spatial spreading, we show a significant speedup on a proof of concept disease simulation. Our second algorithm for threshold random hyperbolic graphs uses a data structure of concentric annuli in the hyperbolic plane. For each given vertex, the positions of possible neighbors in each band can be restricted with the hyperbolic law of cosines, leading to a much reduced number of candidates that need to be checked. This yields a reduced time complexity of O(nlog2n+m)\mathcal{O}(n \log^2 n + m) and a further order of magnitude in practice for graphs of a few million vertices. Finally, we extend also this data structure to general random hyperbolic graphs, with the same time complexity for constant parameters. The second part of my thesis is in many aspects the opposite of the first. Instead of a hidden geometry, I consider graphs whose geometric information is explicit. Instead of using it to generate graphs, I use their geometric information to decide how to cut them into pieces. Given a graph, the Graph Partitioning Problem asks for a disjoint partition of the vertex set so that each subset has a similar number of vertices and some objective function is optimized. Its many applications include parallelizing a computing task while balancing load and minimizing communication, dividing a graph into blocks as preparation for graph analysis tasks or finding natural cuts in street networks for efficient route planning. Since the graph partitioning problem is NP-complete and hard to approximate, heuristics are used in practice. Our first graph partitioner is designed for an application in quantum chemistry. Computing electron density fields is necessary to accurately predict protein interactions, but the required time scales quadratically with the protein\u27s size, with punitive constant factors. This effectively restricts these density-based methods to proteins of at most a few hundred amino acids. It is possible to circumvent this limitation by computing only parts of the target protein at a time, an approach known as subsystem quantum chemistry. However, the interactions between amino acids in different parts are then neglected; this neglect causes errors in the solution. We model this problem as partitioning a protein graph: Each vertex represents one amino acid, each edge an interaction between them and each edge weight the expected error caused by neglecting this interaction. Finding a set of subsets with minimum error is then equivalent to finding a partition of the protein graph with minimum edge cut. The requirements of the chemical simulations cause additional constraints on this partition, as well as an implied geometry by the protein structure. We provide an implementation of the well-known multilevel heuristic together with the local search algorithm by Fiduccia and Mattheyses, both adapted to respect these new constraints. We also provide an optimal dynamic programming algorithm for a restricted scenario with a smaller solution space. In terms of edge cut, we achieve an average improvement of 13.5% against the naive solution, which was previously used by domain scientists. Our second graph partitioner targets geometric meshes from numerical simulations in the scale of billions of grid points, parallelized to tens of thousands of processes. In general purpose graph partitioning, the multilevel heuristic computes high-quality partitions, but its scalability is limited. Due to the large halos in our targeted application, the shape of the partitioned blocks is also important, with convex shapes leading to less communication. We adapt the well-known kk-means algorithm, which yields convex shapes, to the partitioning of geometric meshes by including a balancing scheme. We further extend several existing geometric optimizations to the balanced version to achieve fast running times and parallel scalability. The classic kk-means algorithm is highly dependent on the choice of initial centers. We select initial centers among the input points along a space-filling curve, thus guaranteeing a similar distribution as the input points. The resulting implementation scales to tens of thousands of processes and billions of vertices, partitioning them in seconds. Compared to previous fast geometric partitioners, our method provides partitions with a 10-15% lower communication volume and also a corresponding smaller communication time in a distributed SpMV benchmark

    Hypergraph Partitioning in the Cloud

    Get PDF
    The thesis investigates the partitioning and load balancing problem which has many applications in High Performance Computing (HPC). The application to be partitioned is described with a graph or hypergraph. The latter is of greater interest as hypergraphs, compared to graphs, have a more general structure and can be used to model more complex relationships between groups of objects such as non-symmetric dependencies. Optimal graph and hypergraph partitioning is known to be NP-Hard but good polynomial time heuristic algorithms have been proposed. In this thesis, we propose two multi-level hypergraph partitioning algorithms. The algorithms are based on rough set clustering techniques. The first algorithm, which is a serial algorithm, obtains high quality partitionings and improves the partitioning cut by up to 71\% compared to the state-of-the-art serial hypergraph partitioning algorithms. Furthermore, the capacity of serial algorithms is limited due to the rapid growth of problem sizes of distributed applications. Consequently, we also propose a parallel hypergraph partitioning algorithm. Considering the generality of the hypergraph model, designing a parallel algorithm is difficult and the available parallel hypergraph algorithms offer less scalability compared to their graph counterparts. The issue is twofold: the parallel algorithm and the complexity of the hypergraph structure. Our parallel algorithm provides a trade-off between global and local vertex clustering decisions. By employing novel techniques and approaches, our algorithm achieves better scalability than the state-of-the-art parallel hypergraph partitioner in the Zoltan tool on a set of benchmarks, especially ones with irregular structure. Furthermore, recent advances in cloud computing and the services they provide have led to a trend in moving HPC and large scale distributed applications into the cloud. Despite its advantages, some aspects of the cloud, such as limited network resources, present a challenge to running communication-intensive applications and make them non-scalable in the cloud. While hypergraph partitioning is proposed as a solution for decreasing the communication overhead within parallel distributed applications, it can also offer advantages for running these applications in the cloud. The partitioning is usually done as a pre-processing step before running the parallel application. As parallel hypergraph partitioning itself is a communication-intensive operation, running it in the cloud is hard and suffers from poor scalability. The thesis also investigates the scalability of parallel hypergraph partitioning algorithms in the cloud, the challenges they present, and proposes solutions to improve the cost/performance ratio for running the partitioning problem in the cloud. Our algorithms are implemented as a new hypergraph partitioning package within Zoltan. It is an open source Linux-based toolkit for parallel partitioning, load balancing and data-management designed at Sandia National Labs. The algorithms are known as FEHG and PFEHG algorithms

    HPCCP/CAS Workshop Proceedings 1998

    Get PDF
    This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey

    Runtime support for load balancing of parallel adaptive and irregular applications

    Get PDF
    Applications critical to today\u27s engineering research often must make use of the increased memory and processing power of a parallel machine. While advances in architecture design are leading to more and more powerful parallel systems, the software tools needed to realize their full potential are in a much less advanced state. In particular, efficient, robust, and high-performance runtime support software is critical in the area of dynamic load balancing. While the load balancing of loosely synchronous codes, such as field solvers, has been studied extensively for the past 15 years, there exists a class of problems, known as asynchronous and highly adaptive , for which the dynamic load balancing problem remains open. as we discuss, characteristics of this class of problems render compile-time or static analysis of little benefit, and complicate the dynamic load balancing task immensely.;We make two contributions to this area of research. The first is the design and development of a runtime software toolkit, known as the Parallel Runtime Environment for Multi-computer Applications, or PREMA, which provides interprocessor communication, a global namespace, a framework for the implementation of customized scheduling policies, and several such policies which are prevalent in the load balancing literature. The PREMA system is designed to support coarse-grained domain decompositions with the goals of portability, flexibility, and maintainability in mind, so that developers will quickly feel comfortable incorporating it into existing codes and developing new codes which make use of its functionality. We demonstrate that the programming model and implementation are efficient and lead to the development of robust and high-performance applications.;Our second contribution is in the area of performance modeling. In order to make the most effective use of the PREMA runtime software, certain parameters governing its execution must be set off-line. Optimal values for these parameters may be determined through repeated executions of the target application; however, this is not always possible, particularly in large-scale environments and long-running applications. We present an analytic model that allows the user to quickly and inexpensively predict application performance and fine-tune applications built on the PREMA platform

    ISCR annual report FY 1998

    Full text link
    corecore