1,368 research outputs found
Binary Mesh Partitioning for Cache-Efficient Processing
One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but lack of theoretical performance guarantees. We present in this report a O(N log N) algorithm to compute CO layout for unstructured meshes. We prove that a coherent traversal of a N-size mesh in dimension d will induce less than N/B+O(N/M^{1/d}) cache-misses where B and M are the block size and the cache size. Experiments show that our layout computation is faster and significantly less memory consuming than for the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better if the access pattern is adapted to the binary mesh partitioning produced by the algorithm. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures
High-Performance Computing: Dos and Donâts
Computational fluid dynamics (CFD) is the main field of computational mechanics that has historically benefited from advances in high-performance computing. High-performance computing involves several techniques to make a simulation efficient and fast, such as distributed memory parallelism, shared memory parallelism, vectorization, memory access optimizations, etc. As an introduction, we present the anatomy of supercomputers, with special emphasis on HPC aspects relevant to CFD. Then, we develop some of the HPC concepts and numerical techniques applied to the complete CFD simulation framework: from preprocess (meshing) to postprocess (visualization) through the simulation itself (assembly and iterative solvers)
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
Efficient Generating And Processing Of Large-Scale Unstructured Meshes
Unstructured meshes are used in a variety of disciplines to represent simulations and experimental data. Scientists who want to increase accuracy of simulations by increasing resolution must also increase the size of the resulting dataset. However, generating and processing a extremely large unstructured meshes remains a barrier. Researchers have published many parallel Delaunay triangulation (DT) algorithms, often focusing on partitioning the initial mesh domain, so that each rectangular partition can be triangulated in parallel. However, the comproblems for this method is how to merge all triangulated partitions into a single domain-wide mesh or the significant cost for communication the sub-region borders. We devised a novel algorithm --Triangulation of Independent Partitions in Parallel (TIPP) to deal with very large DT problems without requiring inter-processor communication while still guaranteeing the Delaunay criteria. The core of the algorithm is to find a set of independent} partitions such that the circumcircles of triangles in one partition do not enclose any vertex in other partitions. For this reason, this set of independent partitions can be triangulated in parallel without affecting each other. The results of mesh generation is the large unstructured meshes including vertex index and vertex coordinate files which introduce a new challenge \-- locality. Partitioning unstructured meshes to improve locality is a key part of our own approach. Elements that were widely scattered in the original dataset are grouped together, speeding data access. For further improve unstructured mesh partitioning, we also described our new approach. Direct Load which mitigates the challenges of unstructured meshes by maximizing the proportion of useful data retrieved during each read from disk, which in turn reduces the total number of read operations, boosting performance
ColDICE: a parallel Vlasov-Poisson solver using moving adaptive simplicial tessellation
Resolving numerically Vlasov-Poisson equations for initially cold systems can
be reduced to following the evolution of a three-dimensional sheet evolving in
six-dimensional phase-space. We describe a public parallel numerical algorithm
consisting in representing the phase-space sheet with a conforming,
self-adaptive simplicial tessellation of which the vertices follow the
Lagrangian equations of motion. The algorithm is implemented both in six- and
four-dimensional phase-space. Refinement of the tessellation mesh is performed
using the bisection method and a local representation of the phase-space sheet
at second order relying on additional tracers created when needed at runtime.
In order to preserve in the best way the Hamiltonian nature of the system,
refinement is anisotropic and constrained by measurements of local Poincar\'e
invariants. Resolution of Poisson equation is performed using the fast Fourier
method on a regular rectangular grid, similarly to particle in cells codes. To
compute the density projected onto this grid, the intersection of the
tessellation and the grid is calculated using the method of Franklin and
Kankanhalli (1993) generalised to linear order. As preliminary tests of the
code, we study in four dimensional phase-space the evolution of an initially
small patch in a chaotic potential and the cosmological collapse of a
fluctuation composed of two sinusoidal waves. We also perform a "warm" dark
matter simulation in six-dimensional phase-space that we use to check the
parallel scaling of the code.Comment: Code and illustration movies available at:
http://www.vlasix.org/index.php?n=Main.ColDICE - Article submitted to Journal
of Computational Physic
- âŠ