3 research outputs found

    Effective Large Scale Computing Software for Parallel Mesh Generation

    Get PDF
    Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the final mesh is limited by the amount of aggregate memory of the parallel machine. Also, requesting many compute nodes on a shared computing resource may result in a long waiting, far surpassing the time it takes to solve the problem.;These two problems (i.e., insufficient memory when computing on a small number of nodes, and long waiting times when using many nodes from a shared computing resource) can be addressed by using out-of-core algorithms. These are algorithms that keep most of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion in-core (i.e., into memory) at a time.;We explored two approaches to out-of-core computing. First, we presented a traditional approach, which is to modify the existing in-core algorithms to enable out-of-core computing. While we achieved good performance with this approach the task is complex and labor intensive. An alternative approach, we presented a runtime system designed to support out-of-core applications. It requires little modification of the existing in-core application code and still produces acceptable results. Evaluation of the runtime system showed little performance degradation while simplifying and shortening the development cycle of out-of-core applications. The overhead from using the runtime system for small problem sizes is between 12% and 41% while the overlap of computation, communication and disk I/O is above 50% and as high as 61% for large problems.;The main contribution of our work is the ability to utilize computing resources more effectively. The user has a choice of either solving larger problems, that otherwise would not be possible, or solving problems of the same size but using fewer computing nodes, thus reducing the waiting time on shared clusters and supercomputers. We demonstrated that the latter could potentially lead to substantially shorter wall-clock time

    Load Balancing Scientific Applications

    Get PDF
    The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one at the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime. Dynamic load balancing of parallel applications becomes more critical at scale, while also being expensive. To make the load balance correction affordable at scale, we propose a lazy load balancing strategy that evaluates the imbalance and computes the new assignment of work to processes asynchronously to the main application computation. We decouple the load balance algorithm from the application and run it on potentially fewer, separate processors. In this Multiple Program Multiple Data (MPMD) configuration, the load balance algorithm can execute concurrently with the application and with higher parallel efficiency than if it were run on the same processors as the simulation. Work is reassigned lazily as directions become available, and the application need not wait for the load balance algorithm to complete. We show that we can save resources by running a load balance algorithm at higher parallel efficiency on a smaller number of processors. Using our framework, we explore the trade-offs of load balancing configurations and demonstrate a performance improvement of up to 46%

    Parallel generalized Delaunay mesh refinement

    Get PDF
    The modeling of physical phenomena in computational fracture mechanics, computational fluid dynamics and other fields is based on solving systems of partial differential equations (PDEs). When PDEs are defined over geometrically complex domains, they often do not admit closed form solutions. In such cases, they are solved approximately using discretizations of domains into simple elements like triangles and quadrilaterals in two dimensions (2D), and tetrahedra and hexahedra in three dimensions (3D). These discretizations are called finite element meshes. Many applications, for example, real-time computer assisted surgery, or crack propagation from fracture mechanics, impose time and/or mesh size constraints that cannot be met on a single sequential machine. as a result, the development of parallel mesh generation algorithms is required.;In this dissertation, we describe a complete solution for both sequential and parallel construction of guaranteed quality Delaunay meshes for 2D and 3D geometries. First, we generalize the existing 2D and 3D Delaunay refinement algorithms along with theoretical proofs of mesh quality in terms of element shape and mesh gradation. Existing algorithms are constrained by just one or two specific positions for the insertion of a Steiner point inside a circumscribed disk of a poorly shaped element. We derive an entire 2D or 3D region for the selection of a Steiner point (i.e., infinitely many choices) inside the circumscribed disk. Second, we develop a novel theory which extends both the 2D and the 3D Generalized Delaunay Refinement methods for the concurrent and mathematically guaranteed independent insertion of Steiner points. Previous parallel algorithms are either reactive relying on implementation heuristics to resolve dependencies in parallel mesh generation computations or require the solution of a very difficult geometric optimization problem (the domain decomposition problem) which is still open for general 3D geometries. Our theory solves both of these drawbacks. Third, using our generalization of both the sequential and the parallel algorithms we implemented prototypes of practical and efficient parallel generalized guaranteed quality Delaunay refinement codes for both 2D and 3D geometries using existing state-of-the-art sequential codes for traditional Delaunay refinement methods. On a heterogeneous cluster of more than 100 processors our implementation can generate a uniform mesh with about a billion elements in less than 5 minutes. Even on a workstation with a few cores, we achieve a significant performance improvement over the corresponding state-of-the-art sequential 3D code, for graded meshes
    corecore