16 research outputs found

    Optimization of a Parallel CFD Code and Its Performance Evaluation on Tianhe-1A

    Get PDF
    This paper describes performance tuning experiences with a parallel CFD code to enhance its performance and flexibility on large scale parallel computers. The code solves the incompressible Navier-Stokes equations based on the novel Slightly Compressible Model on three-dimensional structure grids. High level loop transformations and argument based code specialization are utilized to optimize its uniprocessor performance. Static arrays are converted into dynamically allocated arrays to improve the flexibility. The grid generator is coupled with the flow solver so that they can exchange grid data in the memory. A detailed performance evaluation is performed. The results show that our uniprocessor optimizations improve the performance of the flow solver for 1.38 times to 3.93 times on Tianhe-1A supercomputer. In memory grid data exchange optimization speeds up the application startup time by nearly two magnitudes. The optimized code exhibits an excellent parallel scalability running realistic test cases. On 4 096 CPU cores, it achieves a strong scaling parallel efficiency of 77.39 % and a maximum performance of 4.01 Tflops

    Conjugate gradient sparse solvers: performance-power characteristics

    Full text link
    We characterize the performance and power attributes of the conjugate gradient (CG) sparse solver which is widely used in scientific applications. We use cycle-accurate sim-ulations with SimpleScalar and Wattch, on a processor and memory architecture similar to the configuration of a node of the BlueGene/L. We first demonstrate that substantial power savings can be obtained without performance degra-dation if low power modes of caches can be utilized. We next show that if Dynamic Voltage Scaling (DVS) can be used, power and energy savings are possible, but these are realized only at the expense of performance penalties. We then consider two simple memory subsystem optimiza-tions, namely memory and level-2 cache prefetching. We demonstrate that when DVS and low power modes of caches are used with these optimizations, performance can be im-proved significantly with reductions in power and energy. For example, execution time is reduced by 23%, power by 55 % and energy by 65 % in the final configuration at 500MHz relative to the original at 1GHz. We also use our codes and the CG NAS benchmark code to demonstrate that performance and power profiles can vary significantly depending on matrix properties and the level of code tun-ing. These results indicate that architectural evaluations can benefit if traditional benchmarks are augmented with codes more representative of tuned scientific applications.

    Parallelization of Reordering Algorithms for Bandwidth and Wavefront Reduction

    Full text link
    Abstract—Many sparse matrix computations can be speeded up if the matrix is first reordered. Reordering was originally developed for direct methods but it has recently become popular for improving the cache locality of parallel iterative solvers since reordering the matrix to reduce bandwidth and wavefront can improve the locality of reference of sparse matrix-vector multiplication (SpMV), the key kernel in iterative solvers. In this paper, we present the first parallel implementations of two widely used reordering algorithms: Reverse Cuthill-McKee (RCM) and Sloan. On 16 cores of the Stampede supercomputer, our parallel RCM is 5.56 times faster on the average than a state-of-the-art sequential implementation of RCM in the HSL library. Sloan is significantly more constrained than RCM, but our parallel implementation achieves a speedup of 2.88X on the average over sequential HSL-Sloan. Reordering the matrix using our parallel RCM and then performing 100 SpMV iterations is twice as fast as using HSL-RCM and then performing the SpMV iterations; it is also 1.5 times faster than performing the SpMV iterations without reordering the matrix. I

    On the implementation of a robust and efficient finite element-based parallel solver for the compressible Navier-Stokes equations

    Full text link
    This paper describes in detail the implementation of a finite element technique for solving the compressible Navier-Stokes equations that is provably robust and demonstrates excellent performance on modern computer hardware. The method is second-order accurate in time and space. Robustness here means that the method is proved to be invariant domain preserving under the hyperbolic CFL time step restriction, and the method delivers results that are reproducible. The proposed technique is shown to be accurate on challenging 2D and 3D realistic benchmarks

    Hybrid multigrid methods for high-order discontinuous Galerkin discretizations

    Full text link
    The present work develops hybrid multigrid methods for high-order discontinuous Galerkin discretizations of elliptic problems. Fast matrix-free operator evaluation on tensor product elements is used to devise a computationally efficient PDE solver. The multigrid hierarchy exploits all possibilities of geometric, polynomial, and algebraic coarsening, targeting engineering applications on complex geometries. Additionally, a transfer from discontinuous to continuous function spaces is performed within the multigrid hierarchy. This does not only further reduce the problem size of the coarse-grid problem, but also leads to a discretization most suitable for state-of-the-art algebraic multigrid methods applied as coarse-grid solver. The relevant design choices regarding the selection of optimal multigrid coarsening strategies among the various possibilities are discussed with the metric of computational costs as the driving force for algorithmic selections. We find that a transfer to a continuous function space at highest polynomial degree (or on the finest mesh), followed by polynomial and geometric coarsening, shows the best overall performance. The success of this particular multigrid strategy is due to a significant reduction in iteration counts as compared to a transfer from discontinuous to continuous function spaces at lowest polynomial degree (or on the coarsest mesh). The coarsening strategy with transfer to a continuous function space on the finest level leads to a multigrid algorithm that is robust with respect to the penalty parameter of the SIPG method. Detailed numerical investigations are conducted for a series of examples ranging from academic test cases to more complex, practically relevant geometries. Performance comparisons to state-of-the-art methods from the literature demonstrate the versatility and computational efficiency of the proposed multigrid algorithms

    Performance Modeling and Prediction for the Scalable Solution of Partial Differential Equations on Unstructured Grids

    Get PDF
    This dissertation studies the sources of poor performance in scientific computing codes based on partial differential equations (PDEs), which typically perform at a computational rate well below other scientific simulations (e.g., those with dense linear algebra or N-body kernels) on modern architectures with deep memory hierarchies. We identify that the primary factors responsible for this relatively poor performance are: insufficient available memory bandwidth, low ratio of work to data size (good algorithmic efficiency), and nonscaling cost of synchronization and gather/scatter operations (for a fixed problem size scaling). This dissertation also illustrates how to reuse the legacy scientific and engineering software within a library framework. Specifically, a three-dimensional unstructured grid incompressible Euler code from NASA has been parallelized with the Portable Extensible Toolkit for Scientific Computing (PETSc) library for distributed memory architectures. Using this newly instrumented code (called PETSc-FUN3D) as an example of a typical PDE solver, we demonstrate some strategies that are effective in tolerating the latencies arising from the hierarchical memory system and the network. Even on a single processor from each of the major contemporary architectural families, the PETSc-FUN3D code runs from 2.5 to 7.5 times faster than the legacy code on a medium-sized data set (with approximately 105 degrees of freedom). The major source of performance improvement is the increased locality in data reference patterns achieved through blocking, interlacing, and edge reordering. To explain these performance gains, we provide simple performance models based on memory bandwidth and instruction issue rates. Experimental evidence, in terms of translation lookaside buffer (TLB) and data cache miss rates, achieved memory bandwidth, and graduated floating point instructions per memory reference, is provided through accurate measurements with hardware counters. The performance models and experimental results motivate algorithmic and software practices that lead to improvements in both parallel scalability and per-node performance. We identify the bottlenecks to scalability (algorithmic as well as implementation) for a fixed-size problem when the number of processors grows to several thousands (the expected level of concurrency on terascale architectures). We also evaluate the hybrid programming model (mixed distributed/shared) from a performance standpoint
    corecore