826 research outputs found
Construction and Application of an AMR Algorithm for Distributed Memory Computers
While the parallelization of blockstructured adaptive mesh refinement techniques is relatively straight-forward on shared memory architectures, appropriate distribution strategies for the emerging generation of distributed
memory machines are a topic of on-going research. In this paper, a locality-preserving domain decomposition is proposed that partitions the entire AMR hierarchy from the base level on. It is shown that the approach reduces the
communication costs and simplifies the implementation. Emphasis is put on the effective parallelization of the flux correction procedure at coarse-fine boundaries, which is indispensable for conservative finite volume schemes. An
easily reproducible standard benchmark and a highly resolved parallel AMR
simulation of a diffracting hydrogen-oxygen detonation demonstrate the proposed
strategy in practice
A scalable parallel finite element framework for growing geometries. Application to metal additive manufacturing
This work introduces an innovative parallel, fully-distributed finite element
framework for growing geometries and its application to metal additive
manufacturing. It is well-known that virtual part design and qualification in
additive manufacturing requires highly-accurate multiscale and multiphysics
analyses. Only high performance computing tools are able to handle such
complexity in time frames compatible with time-to-market. However, efficiency,
without loss of accuracy, has rarely held the centre stage in the numerical
community. Here, in contrast, the framework is designed to adequately exploit
the resources of high-end distributed-memory machines. It is grounded on three
building blocks: (1) Hierarchical adaptive mesh refinement with octree-based
meshes; (2) a parallel strategy to model the growth of the geometry; (3)
state-of-the-art parallel iterative linear solvers. Computational experiments
consider the heat transfer analysis at the part scale of the printing process
by powder-bed technologies. After verification against a 3D benchmark, a
strong-scaling analysis assesses performance and identifies major sources of
parallel overhead. A third numerical example examines the efficiency and
robustness of (2) in a curved 3D shape. Unprecedented parallelism and
scalability were achieved in this work. Hence, this framework contributes to
take on higher complexity and/or accuracy, not only of part-scale simulations
of metal or polymer additive manufacturing, but also in welding, sedimentation,
atherosclerosis, or any other physical problem where the physical domain of
interest grows in time
One machine, one minute, three billion tetrahedra
This paper presents a new scalable parallelization scheme to generate the 3D
Delaunay triangulation of a given set of points. Our first contribution is an
efficient serial implementation of the incremental Delaunay insertion
algorithm. A simple dedicated data structure, an efficient sorting of the
points and the optimization of the insertion algorithm have permitted to
accelerate reference implementations by a factor three. Our second contribution
is a multi-threaded version of the Delaunay kernel that is able to concurrently
insert vertices. Moore curve coordinates are used to partition the point set,
avoiding heavy synchronization overheads. Conflicts are managed by modifying
the partitions with a simple rescaling of the space-filling curve. The
performances of our implementation have been measured on three different
processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we
have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds
to a generation rate of over 55 million tetrahedra per second. We finally show
how this very efficient parallel Delaunay triangulation can be integrated in a
Delaunay refinement mesh generator which takes as input the triangulated
surface boundary of the volume to mesh
Dataflow methods in HPC, visualisation and analysis
The processing power available to scientists and engineers using supercomputers over the last few decades has grown exponentially, permitting significantly more sophisticated simulations, and as a consequence, generating proportionally larger output datasets. This change has taken place in tandem with a gradual shift in the design and implementation of simulation and post-processing software, with a shift from simulation as a first step and visualisation/analysis as a second, towards in-situ on the fly methods that provide immediate visual feedback, place less strain on file-systems and reduce overall data-movement and copying. Concurrently, processor speed increases have dramatically slowed and multi and many-core architectures have instead become the norm for virtually all High Performance computing (HPC) machines. This in turn has led to a shift away from the traditional distributed one rank per node model, to one rank per process, using multiple processes per multicore node, and then back towards one rank per node again, using distributed and multi-threaded frameworks combined.
This thesis consists of a series of publications that demonstrate how software design for analysis and visualisation has tracked these architectural changes and pushed the boundaries of HPC visualisation using dataflow techniques in distributed environments. The first publication shows how support for the time dimension in parallel pipelines can be implemented, demonstrating how information flow within an application can be leveraged to optimise performance and add features such as analysis of time-dependent flows and comparison of datasets at different timesteps. A method of integrating dataflow pipelines with in-situ visualisation is subsequently presented, using asynchronous coupling of user driven GUI controls and a live simulation running on a supercomputer. The loose coupling of analysis and simulation allows for reduced IO, immediate feedback and the ability to change simulation parameters on the fly.
A significant drawback of parallel pipelines is the inefficiency caused by improper load-balancing, particularly during interactive analysis where the user may select between different features of interest, this problem is addressed in the fourth publication by integrating a high performance partitioning library into the visualization pipeline and extending the information flow up and down the pipeline to support it. This extension is demonstrated in the third publication (published earlier) on massive meshes with extremely high complexity and shows that general purpose visualization tools such as ParaView can be made to compete with bespoke software written for a dedicated task.
The future of software running on many-core architectures will involve task-based runtimes, with dynamic load-balancing, asynchronous execution based on dataflow graphs, work stealing and concurrent data sharing between simulation and analysis. The final paper of this thesis presents an optimisation for one such runtime, in support of these future HPC applications
- …