7,528 research outputs found
A sparse octree gravitational N-body code that runs entirely on the GPU processor
We present parallel algorithms for constructing and traversing sparse octrees
on graphics processing units (GPUs). The algorithms are based on parallel-scan
and sort methods. To test the performance and feasibility, we implemented them
in CUDA in the form of a gravitational tree-code which completely runs on the
GPU.(The code is publicly available at:
http://castle.strw.leidenuniv.nl/software.html) The tree construction and
traverse algorithms are portable to many-core devices which have support for
CUDA or OpenCL programming languages. The gravitational tree-code outperforms
tuned CPU code during the tree-construction and shows a performance improvement
of more than a factor 20 overall, resulting in a processing rate of more than
2.8 million particles per second.Comment: Accepted version. Published in Journal of Computational Physics. 35
pages, 12 figures, single colum
QYMSYM: A GPU-Accelerated Hybrid Symplectic Integrator That Permits Close Encounters
We describe a parallel hybrid symplectic integrator for planetary system
integration that runs on a graphics processing unit (GPU). The integrator
identifies close approaches between particles and switches from symplectic to
Hermite algorithms for particles that require higher resolution integrations.
The integrator is approximately as accurate as other hybrid symplectic
integrators but is GPU accelerated.Comment: 17 pages, 2 figure
Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor
With the ease-of-programming, flexibility and yet efficiency, MapReduce has
become one of the most popular frameworks for building big-data applications.
MapReduce was originally designed for distributed-computing, and has been
extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In
this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is
the latest product released by Intel based on the Many Integrated Core
Architecture. To the best of our knowledge, this is the first work to optimize
the MapReduce framework on the Xeon Phi.
In our work, we utilize advanced features of the Xeon Phi to achieve high
performance. In order to take advantage of the SIMD vector processing units, we
propose a vectorization friendly technique for the map phase to assist the
auto-vectorization as well as develop SIMD hash computation algorithms.
Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to
improve the resource utilization. We also eliminate multiple local arrays but
use low cost atomic operations on the global array for some applications, which
can improve the thread scalability and data locality due to the coherent L2
caches. Finally, for a given application, our framework can either
automatically detect suitable techniques to apply or provide guideline for
users at compilation time. We conduct comprehensive experiments to benchmark
the Xeon Phi and compare our optimized MapReduce framework with a
state-of-the-art multi-core based MapReduce framework (Phoenix++). By
evaluating six real-world applications, the experimental results show that our
optimized framework is 1.2X to 38X faster than Phoenix++ for various
applications on the Xeon Phi
Parallel Algorithm and Dynamic Exponent for Diffusion-limited Aggregation
A parallel algorithm for ``diffusion-limited aggregation'' (DLA) is described
and analyzed from the perspective of computational complexity. The dynamic
exponent z of the algorithm is defined with respect to the probabilistic
parallel random-access machine (PRAM) model of parallel computation according
to , where L is the cluster size, T is the running time, and the
algorithm uses a number of processors polynomial in L\@. It is argued that
z=D-D_2/2, where D is the fractal dimension and D_2 is the second generalized
dimension. Simulations of DLA are carried out to measure D_2 and to test
scaling assumptions employed in the complexity analysis of the parallel
algorithm. It is plausible that the parallel algorithm attains the minimum
possible value of the dynamic exponent in which case z characterizes the
intrinsic history dependence of DLA.Comment: 24 pages Revtex and 2 figures. A major improvement to the algorithm
and smaller dynamic exponent in this versio
Entropy-based parametric estimation of spike train statistics
We consider the evolution of a network of neurons, focusing on the asymptotic
behavior of spikes dynamics instead of membrane potential dynamics. The spike
response is not sought as a deterministic response in this context, but as a
conditional probability : "Reading out the code" consists of inferring such a
probability. This probability is computed from empirical raster plots, by using
the framework of thermodynamic formalism in ergodic theory. This gives us a
parametric statistical model where the probability has the form of a Gibbs
distribution. In this respect, this approach generalizes the seminal and
profound work of Schneidman and collaborators. A minimal presentation of the
formalism is reviewed here, while a general algorithmic estimation method is
proposed yielding fast convergent implementations. It is also made explicit how
several spike observables (entropy, rate, synchronizations, correlations) are
given in closed-form from the parametric estimation. This paradigm does not
only allow us to estimate the spike statistics, given a design choice, but also
to compare different models, thus answering comparative questions about the
neural code such as : "are correlations (or time synchrony or a given set of
spike patterns, ..) significant with respect to rate coding only ?" A numerical
validation of the method is proposed and the perspectives regarding spike-train
code analysis are also discussed.Comment: 37 pages, 8 figures, submitte
Implicit Decomposition for Write-Efficient Connectivity Algorithms
The future of main memory appears to lie in the direction of new technologies
that provide strong capacity-to-performance ratios, but have write operations
that are much more expensive than reads in terms of latency, bandwidth, and
energy. Motivated by this trend, we propose sequential and parallel algorithms
to solve graph connectivity problems using significantly fewer writes than
conventional algorithms. Our primary algorithmic tool is the construction of an
-sized "implicit decomposition" of a bounded-degree graph on
nodes, which combined with read-only access to enables fast answers to
connectivity and biconnectivity queries on . The construction breaks the
linear-write "barrier", resulting in costs that are asymptotically lower than
conventional algorithms while adding only a modest cost to querying time. For
general non-sparse graphs on edges, we also provide the first writes
and operations parallel algorithms for connectivity and biconnectivity.
These algorithms provide insight into how applications can efficiently process
computations on large graphs in systems with read-write asymmetry
Real-time lattice boltzmann shallow waters method for breaking wave simulations
We present a new approach for the simulation of surfacebased fluids based in a hybrid formulation of Lattice Boltzmann Method for Shallow Waters and particle systems. The modified LBM can handle arbitrary underlying terrain conditions and arbitrary fluid depth. It also introduces a novel method for tracking dry-wet regions and moving boundaries. Dynamic rigid bodies are also included in our simulations using a two-way coupling. Certain features of the simulation that the LBM can not handle because of its heightfield nature, as breaking waves, are detected and automatically turned into splash particles. Here we use a ballistic particle system, but our hybrid method can handle more complex systems as SPH. Both the LBM and particle systems are implemented in CUDA, although dynamic rigid bodies are simulated in CPU. We show the effectiveness of our method with various examples which achieve real-time on consumer-level hardware.Peer ReviewedPostprint (author's final draft
A Parallel Monte Carlo Code for Simulating Collisional N-body Systems
We present a new parallel code for computing the dynamical evolution of
collisional N-body systems with up to N~10^7 particles. Our code is based on
the the Henon Monte Carlo method for solving the Fokker-Planck equation, and
makes assumptions of spherical symmetry and dynamical equilibrium. The
principal algorithmic developments involve optimizing data structures, and the
introduction of a parallel random number generation scheme, as well as a
parallel sorting algorithm, required to find nearest neighbors for interactions
and to compute the gravitational potential. The new algorithms we introduce
along with our choice of decomposition scheme minimize communication costs and
ensure optimal distribution of data and workload among the processing units.
The implementation uses the Message Passing Interface (MPI) library for
communication, which makes it portable to many different supercomputing
architectures. We validate the code by calculating the evolution of clusters
with initial Plummer distribution functions up to core collapse with the number
of stars, N, spanning three orders of magnitude, from 10^5 to 10^7. We find
that our results are in good agreement with self-similar core-collapse
solutions, and the core collapse times generally agree with expectations from
the literature. Also, we observe good total energy conservation, within less
than 0.04% throughout all simulations. We analyze the performance of the code,
and demonstrate near-linear scaling of the runtime with the number of
processors up to 64 processors for N=10^5, 128 for N=10^6 and 256 for N=10^7.
The runtime reaches a saturation with the addition of more processors beyond
these limits which is a characteristic of the parallel sorting algorithm. The
resulting maximum speedups we achieve are approximately 60x, 100x, and 220x,
respectively.Comment: 53 pages, 13 figures, accepted for publication in ApJ Supplement
- …