87 research outputs found

    Petascale turbulence simulation using a highly parallel fast multipole method on GPUs

    Full text link
    This paper reports large-scale direct numerical simulations of homogeneous-isotropic fluid turbulence, achieving sustained performance of 1.08 petaflop/s on gpu hardware using single precision. The simulations use a vortex particle method to solve the Navier-Stokes equations, with a highly parallel fast multipole method (FMM) as numerical engine, and match the current record in mesh size for this application, a cube of 4096^3 computational points solved with a spectral method. The standard numerical approach used in this field is the pseudo-spectral method, relying on the FFT algorithm as numerical engine. The particle-based simulations presented in this paper quantitatively match the kinetic energy spectrum obtained with a pseudo-spectral method, using a trusted code. In terms of parallel performance, weak scaling results show the fmm-based vortex method achieving 74% parallel efficiency on 4096 processes (one gpu per mpi process, 3 gpus per node of the TSUBAME-2.0 system). The FFT-based spectral method is able to achieve just 14% parallel efficiency on the same number of mpi processes (using only cpu cores), due to the all-to-all communication pattern of the FFT algorithm. The calculation time for one time step was 108 seconds for the vortex method and 154 seconds for the spectral method, under these conditions. Computing with 69 billion particles, this work exceeds by an order of magnitude the largest vortex method calculations to date

    Stable multilevel splittings of boundary edge element spaces

    Get PDF
    We establish the stability of nodal multilevel decompositions of lowest-order conforming boundary element subspaces of the trace space H12(divΓ,Γ){\boldsymbol{H}}^{-\frac {1}{2}}(\operatorname {div}_{\varGamma },{\varGamma }) of H(curl,Ω){\boldsymbol{H}}(\operatorname {\bf curl},{\varOmega }) on boundaries of triangulated Lipschitz polyhedra. The decompositions are based on nested triangular meshes created by uniform refinement and the stability bounds are uniform in the number of refinement levels. The main tool is the general theory of P.Oswald (Interface preconditioners and multilevel extension operators, in Proc. 11th Intern. Conf. on Domain Decomposition Methods, London, 1998, pp.96-103) that teaches, when stability of decompositions of boundary element spaces with respect to trace norms can be inferred from corresponding stability results for finite element spaces. H(curl,Ω){\boldsymbol{H}}(\operatorname {\bf curl},{\varOmega }) -stable discrete extension operators are instrumental in this. Stable multilevel decompositions immediately spawn subspace correction preconditioners whose performance will not degrade on very fine surface meshes. Thus, the results of this article demonstrate how to construct optimal iterative solvers for the linear systems of equations arising from the Galerkin edge element discretization of boundary integral equations for eddy current problem

    Supporting general data structures and execution models in runtime environments

    Get PDF
    Para aprovechar las plataformas paralelas, se necesitan herramientas de programación para poder representar apropiadamente los algoritmos paralelos. Además, los entornos paralelos requieren sistemas en tiempo de ejecución que ofrezcan diferentes paradigmas de computación. Existen diferentes áreas a estudiar con el fin de construir un sistema en tiempo de ejecución completo para un entorno paralelo. Esta Tesis aborda dos problemas comunes: el soporte unificado de datos densos y dispersos, y la integración de paralelismo orientado a mapeo de datos y paralelismo orientado a flujo de datos. Esta Tesis propone una solución que desacopla la representación, partición y reparto de datos, del algoritmo y de la estrategia de diseño paralelo para integrar manejo para datos densos y dispersos. Además, se presenta un nuevo modelo de programación basado en el paradigma de flujo de datos, donde diferentes actividades pueden ser arbitrariamente enlazadas para formar redes genéricas pero estructuradas que representan el cómputo globalDepartamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos

    Resource-aware Data Parallel Array Processing

    Get PDF

    Semiannual report

    Get PDF
    This report summarizes research conducted at the Institute for Computer Applications in Science and Engineering in applied mathematics, fluid mechanics, and computer science during the period 1 Oct. 1994 - 31 Mar. 1995

    木を用いた構造化並列プログラミング

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201
    corecore