121,064 research outputs found

    Parallel efficiency of a boundary integral equation method for nonlinear water waves

    Get PDF
    We describe the application of domain decomposition on a boundary integral method for the study of nonlinear surface waves on water in a test case for which the domain decomposition approach is an important tool to reduce the computational effort. An important aspect is the determination of the optimum number of domains for a given parallel architecture. Previous work on hetero- geneous clusters of workstations is extended to (dedicated) parallel platforms. For these systems a better indication of the parallel performance of the domain decomposition method is obtained because of the absence of varying speed of the processing elements

    Parallel TREE code for two-component ultracold plasma analysis

    Full text link
    The TREE method has been widely used for long-range interaction {\it N}-body problems. We have developed a parallel TREE code for two-component classical plasmas with open boundary conditions and highly non-uniform charge distributions. The program efficiently handles millions of particles evolved over long relaxation times requiring millions of time steps. Appropriate domain decomposition and dynamic data management were employed, and large-scale parallel processing was achieved using an intermediate level of granularity of domain decomposition and ghost TREE communication. Even though the computational load is not fully distributed in fine grains, high parallel efficiency was achieved for ultracold plasma systems of charged particles. As an application, we performed simulations of an ultracold neutral plasma with a half million particles and a half million time steps. For the long temporal trajectories of relaxation between heavy ions and light electrons, large configurations of ultracold plasmas can now be investigated, which was not possible in past studies

    A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

    Full text link
    A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.Comment: Submitted to Parallel Computin

    Parallel HOP: A Scalable Halo Finder for Massive Cosmological Data Sets

    Full text link
    Modern N-body cosmological simulations contain billions (10910^9) of dark matter particles. These simulations require hundreds to thousands of gigabytes of memory, and employ hundreds to tens of thousands of processing cores on many compute nodes. In order to study the distribution of dark matter in a cosmological simulation, the dark matter halos must be identified using a halo finder, which establishes the halo membership of every particle in the simulation. The resources required for halo finding are similar to the requirements for the simulation itself. In particular, simulations have become too extensive to use commonly-employed halo finders, such that the computational requirements to identify halos must now be spread across multiple nodes and cores. Here we present a scalable-parallel halo finding method called Parallel HOP for large-scale cosmological simulation data. Based on the halo finder HOP, it utilizes MPI and domain decomposition to distribute the halo finding workload across multiple compute nodes, enabling analysis of much larger datasets than is possible with the strictly serial or previous parallel implementations of HOP. We provide a reference implementation of this method as a part of the toolkit yt, an analysis toolkit for Adaptive Mesh Refinement (AMR) data that includes complementary analysis modules. Additionally, we discuss a suite of benchmarks that demonstrate that this method scales well up to several hundred tasks and datasets in excess of 200032000^3 particles. The Parallel HOP method and our implementation can be readily applied to any kind of N-body simulation data and is therefore widely applicable.Comment: 29 pages, 11 figures, 2 table

    Predicting Flows of Rarefied Gases

    Get PDF
    DSMC Analysis Code (DAC) is a flexible, highly automated, easy-to-use computer program for predicting flows of rarefied gases -- especially flows of upper-atmospheric, propulsion, and vented gases impinging on spacecraft surfaces. DAC implements the direct simulation Monte Carlo (DSMC) method, which is widely recognized as standard for simulating flows at densities so low that the continuum-based equations of computational fluid dynamics are invalid. DAC enables users to model complex surface shapes and boundary conditions quickly and easily. The discretization of a flow field into computational grids is automated, thereby relieving the user of a traditionally time-consuming task while ensuring (1) appropriate refinement of grids throughout the computational domain, (2) determination of optimal settings for temporal discretization and other simulation parameters, and (3) satisfaction of the fundamental constraints of the method. In so doing, DAC ensures an accurate and efficient simulation. In addition, DAC can utilize parallel processing to reduce computation time. The domain decomposition needed for parallel processing is completely automated, and the software employs a dynamic load-balancing mechanism to ensure optimal parallel efficiency throughout the simulation

    A transient FETI methodology for large-scale parallel implicit computations in structural mechanics, part 2

    Get PDF
    Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because explicit schemes are also easier to parallellize than implicit ones. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet and perhaps will never be offset by the speed of parallel hardware. Therefore, it is essential to develop efficient and robust alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient than explicit codes when simulating low-frequency dynamics. Here we present a domain decomposition method for implicit schemes that requires significantly less storage than factorization algorithms, that is several times faster than other popular direct and iterative methods, that can be easily implemented on both shared and local memory parallel processors, and that is both computationally and communication-wise efficient. The proposed transient domain decomposition method is an extension of the method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat and Roux for the solution of static problems. Serial and parallel performance results on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for realistic structural dynamics problems. These results establish the superiority of the FETI method over both the serial/parallel conjugate gradient algorithm with diagonal scaling and the serial/parallel direct method, and contrast the computational power of the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system

    Unstructured parallel grid generation.

    Get PDF
    The ultimate goal of this study is to develop a 'tool' by which large-scale unstructured grids for realistic engineering problems can be generated efficiently on any parallel computer platform. The adopted strategy is based upon a geometrical partitioning concept, where the computational domain is sub-divided into a number of sub-domains which are then gridded independently in parallel. This study focuses on three-dimensional applications only, and it implements a Delaunay triangulation based generator to generate the sub-domain grids. Two different approaches have been investigated, where the variations between them are limited to (i) the domain decomposition and (ii) the inter-domain boundary gridding algorithms only. In order to carry out the domain decomposition task, the first approach requires an initial tetrahedral grid to be constructed, whilst the second approach operates directly on the boundary triangular grid. Hence, this thesis will refer to the first approach as 'indirect decomposition method' and to the second as 'direct decomposition method'. Work presented in this thesis also concerns the development of a framework in which all different sub-algorithms are integrated in combination with a specially designed parallel processing technique, termed as Dynamic Parallel Processing (DPP). The framework adopts the Message Passing Library (MPL) programming model and implements a Single Program Multiple Data (SPMD) structure with a Manager/Workers mechanism. The DPP provides great flexibility and efficiency in exploiting the available computing resources. The framework has proved to be a very effective tool for generating large-scale grids. Grids of realistic engineering problems and to the order of 115 million elements, generated using one processor on an SGI Challenge machine with 512 MBytes of shared memory, will be presented

    Analog Signal Processing Elements for Energy-Constrained Platforms

    Get PDF
    Energy constrained processing poses a number of challenges that have resulted in tremendous innovations over the past decade. Shrinking supply voltages and limited clock speeds have placed an emphasis on processing efficiency over the raw throughput of a processor. One of the approaches to increase processing efficiency is to use parallel processing with slower, lower resolution processing elements. By utilizing this parallel approach, power consumption can be decreased while maintaining data throughput relative to other more power-hungry architectures.;This low resolution / parallel architecture has direct application in the analog as well as the digital domain. Indeed, research shows that as the resolution of a signal processor falls below a system-dependent threshold, it is almost always more efficient to preform the processing in the analog domain. These continuous-time circuits have long been used in the most energy-constrained applications, ranging from pacemakers and cochlear implants to wireless sensor motes designed to run autonomously for months in the field.;Most audio processing techniques utilize spectral decomposition as the first step of their algorithms, whether by a FFT/DFT in the digital domain or a bank of bandpass filters in the analog domain. The work presented here is designed to function within the parallel, array-based environment of a bank of bandpass filters. Work to improve the simulation of programmable analog storage elements (Floating-Gate transistors) in typical SPICE-based simulators is presented, along with a novel method of harnessing the unique properties of these Floating-Gate (FG) transistors to extend the linear range of a differential pair. These improvements in simulation and linearity are demonstrated in a Variable-Gain Amplfier (VGA) to compress large differential inputs into small single-ended outputs suitable for processing by other analog elements. Finally, a novel circuit composed of only six transistors is proposed to compute the continuous-time derivative of a signal within the sub-banded architecture of the bandpass filter bank
    • …
    corecore