5,913 research outputs found
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Advanced Message Routing for Scalable Distributed Simulations
The Joint Forces Command (JFCOM) Experimentation Directorate (J9)'s recent Joint Urban Operations (JUO)
experiments have demonstrated the viability of Forces Modeling and Simulation in a distributed environment. The
JSAF application suite, combined with the RTI-s communications system, provides the ability to run distributed
simulations with sites located across the United States, from Norfolk, Virginia to Maui, Hawaii. Interest-aware
routers are essential for communications in the large, distributed environments, and the current RTI-s framework
provides such routers connected in a straightforward tree topology. This approach is successful for small to medium
sized simulations, but faces a number of significant limitations for very large simulations over high-latency, wide
area networks. In particular, traffic is forced through a single site, drastically increasing distances messages must
travel to sites not near the top of the tree. Aggregate bandwidth is limited to the bandwidth of the site hosting the
top router, and failures in the upper levels of the router tree can result in widespread communications losses
throughout the system.
To resolve these issues, this work extends the RTI-s software router infrastructure to accommodate more
sophisticated, general router topologies, including both the existing tree framework and a new generalization of the
fully connected mesh topologies used in the SF Express ModSAF simulations of 100K fully interacting vehicles.
The new software router objects incorporate the scalable features of the SF Express design, while optionally using
low-level RTI-s objects to perform actual site-to-site communications. The (substantial) limitations of the original
mesh router formalism have been eliminated, allowing fully dynamic operations. The mesh topology capabilities
allow aggregate bandwidth and site-to-site latencies to match actual network performance. The heavy resource load at
the root node can now be distributed across routers at the participating sites
Group implicit concurrent algorithms in nonlinear structural dynamics
During the 70's and 80's, considerable effort was devoted to developing efficient and reliable time stepping procedures for transient structural analysis. Mathematically, the equations governing this type of problems are generally stiff, i.e., they exhibit a wide spectrum in the linear range. The algorithms best suited to this type of applications are those which accurately integrate the low frequency content of the response without necessitating the resolution of the high frequency modes. This means that the algorithms must be unconditionally stable, which in turn rules out explicit integration. The most exciting possibility in the algorithms development area in recent years has been the advent of parallel computers with multiprocessing capabilities. So, this work is mainly concerned with the development of parallel algorithms in the area of structural dynamics. A primary objective is to devise unconditionally stable and accurate time stepping procedures which lend themselves to an efficient implementation in concurrent machines. Some features of the new computer architecture are summarized. A brief survey of current efforts in the area is presented. A new class of concurrent procedures, or Group Implicit algorithms is introduced and analyzed. The numerical simulation shows that GI algorithms hold considerable promise for application in coarse grain as well as medium grain parallel computers
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
High-resolution simulations of planetesimal formation in turbulent protoplanetary discs
We present high-resolution computer simulations of dust dynamics and
planetesimal formation in turbulence generated by the magnetorotational
instability. We show that the turbulent viscosity associated with
magnetorotational turbulence in a non-stratified shearing box increases when
going from 256^3 to 512^3 grid points in the presence of a weak imposed
magnetic field, yielding a turbulent viscosity of at high
resolution. Particles representing approximately meter-sized boulders
concentrate in large-scale high-pressure regions in the simulation box. The
appearance of zonal flows and particle concentration in pressure bumps is
relatively similar at moderate (256^3) and high (512^3) resolution. In the
moderate-resolution simulation we activate particle self-gravity at a time when
there is little particle concentration, in contrast with previous simulations
where particle self-gravity was activated during a concentration event. We
observe that bound clumps form over the next ten orbits, with initial birth
masses of a few times the dwarf planet Ceres. At high resolution we activate
self-gravity during a particle concentration event, leading to a burst of
planetesimal formation, with clump masses ranging from a significant fraction
of to several times the mass of Ceres. We present a new domain decomposition
algorithm for particle-mesh schemes. Particles are spread evenly among the
processors and the local gas velocity field and assigned drag forces are
exchanged between a domain-decomposed mesh and discrete blocks of particles. We
obtain good load balancing on up to 4096 cores even in simulations where
particles sediment to the mid-plane and concentrate in pressure bumps.Comment: Accepted for publication in Astronomy & Astrophysics, with some
changes in response to referee repor
- …