3 research outputs found
Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation
As an entry for the 2001 Gordon Bell Award in the "special" category, we
describe our 3-d, hybrid, adaptive mesh refinement (AMR) code, Enzo, designed
for high-resolution, multiphysics, cosmological structure formation
simulations. Our parallel implementation places no limit on the depth or
complexity of the adaptive grid hierarchy, allowing us to achieve unprecedented
spatial and temporal dynamic range. We report on a simulation of primordial
star formation which develops over 8000 subgrids at 34 levels of refinement to
achieve a local refinement of a factor of 10^12 in space and time. This allows
us to resolve the properties of the first stars which form in the universe
assuming standard physics and a standard cosmological model. Achieving extreme
resolution requires the use of 128-bit extended precision arithmetic (EPA) to
accurately specify the subgrid positions. We describe our EPA AMR
implementation on the IBM SP2 Blue Horizon system at the San Diego
Supercomputer Center.Comment: 23 pages, 5 figures. Peer reviewed technical paper accepted to the
proceedings of Supercomputing 2001. This entry was a Gordon Bell Prize
finalist. For more information visit http://www.TomAbel.com/GB
4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem
As an entry for the 2012 Gordon-Bell performance prize, we report performance
results of astrophysical N-body simulations of one trillion particles performed
on the full system of K computer. This is the first gravitational trillion-body
simulation in the world. We describe the scientific motivation, the numerical
algorithm, the parallelization strategy, and the performance analysis. Unlike
many previous Gordon-Bell prize winners that used the tree algorithm for
astrophysical N-body simulations, we used the hybrid TreePM method, for similar
level of accuracy in which the short-range force is calculated by the tree
algorithm, and the long-range force is solved by the particle-mesh algorithm.
We developed a highly-tuned gravity kernel for short-range forces, and a novel
communication algorithm for long-range forces. The average performance on 24576
and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49%
and 42% of the peak speed.Comment: 10 pages, 6 figures, Proceedings of Supercomputing 2012
(http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional
information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp201
A Scalable Parallel Architecture with FPGA-Based Network Processor for Scientific Computing
This thesis discuss the design and the implementation of an FPGA-Based
Network Processor for scientific computing, like Lattice Quantum ChromoDinamycs
(LQCD) and fluid-dynamics applications based on Lattice Boltzmann
Methods (LBM). State-of-the-art programs in this (and other similar)
applications have a large degree of available parallelism, that can be easily
exploited on massively parallel systems, provided the underlying communication
network has not only high-bandwidth but also low-latency.
I have designed in details, built and tested in hardware, firmware and
software an implementation of a Network Processor, tailored for the most
recent families of multi-core processors. The implementation has been developed
on an FPGA device to easily interface the logic of NWP with the CPU
I/O sub-system.
In this work I have assessed several ways to move data between the main
memory of the CPU and the I/O sub-system to exploit high data throughput
and low latency, enabling the use of “Programmed Input Output” (PIO),
“Direct Memory Access” (DMA) and “Write Combining” memory-settings.
On the software side, I developed and test a device driver for the Linux
operating system to access the NWP device, as well as a system library to
efficiently access the network device from user-applications.
This thesis demonstrates the feasibility of a network infrastructure that
saturates the maximum bandwidth of the I/O sub-systems available on recent
CPUs, and reduces communication latencies to values very close to those
needed by the processor to move data across the chip boundary