14,392 research outputs found
Gauge Field Generation on Large-Scale GPU-Enabled Systems
Over the past years GPUs have been successfully applied to the task of
inverting the fermion matrix in lattice QCD calculations. Even strong scaling
to capability-level supercomputers, corresponding to O(100) GPUs or more has
been achieved. However strong scaling a whole gauge field generation algorithm
to this regim requires significantly more functionality than just having the
matrix inverter utilizing the GPUs and has not yet been accomplished. This
contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled
parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this
regime. Initial results are shown for gauge field generation with Chroma
simulating pure Wilson fermions on OLCF TitanDev.Comment: The 30th International Symposium on Lattice Field Theory, June 24-29,
2012, Cairns, Australia (Acknowledgment and Citation added
Parallel Tempering Simulation of the three-dimensional Edwards-Anderson Model with Compact Asynchronous Multispin Coding on GPU
Monte Carlo simulations of the Ising model play an important role in the
field of computational statistical physics, and they have revealed many
properties of the model over the past few decades. However, the effect of
frustration due to random disorder, in particular the possible spin glass
phase, remains a crucial but poorly understood problem. One of the obstacles in
the Monte Carlo simulation of random frustrated systems is their long
relaxation time making an efficient parallel implementation on state-of-the-art
computation platforms highly desirable. The Graphics Processing Unit (GPU) is
such a platform that provides an opportunity to significantly enhance the
computational performance and thus gain new insight into this problem. In this
paper, we present optimization and tuning approaches for the CUDA
implementation of the spin glass simulation on GPUs. We discuss the integration
of various design alternatives, such as GPU kernel construction with minimal
communication, memory tiling, and look-up tables. We present a binary data
format, Compact Asynchronous Multispin Coding (CAMSC), which provides an
additional speedup compared with the traditionally used Asynchronous
Multispin Coding (AMSC). Our overall design sustains a performance of 33.5
picoseconds per spin flip attempt for simulating the three-dimensional
Edwards-Anderson model with parallel tempering, which significantly improves
the performance over existing GPU implementations.Comment: 15 pages, 18 figure
Morphological diagram of diffusion driven aggregate growth in plane: competition of anisotropy and adhesion
Two-dimensional structures grown with Witten and Sander algorithm are
investigated. We analyze clusters grown off-lattice and clusters grown with
antenna method with and 8 allowed growth directions. With
the help of variable probe particles technique we measure fractal dimension of
such clusters as a function of their size . We propose that in the
thermodynamic limit of infinite cluster size the aggregates grown with high
degree of anisotropy () tend to have fractal dimension equal
to 3/2, while off-lattice aggregates and aggregates with lower anisotropy
() have . Noise-reduction procedure results in the
change of universality class for DLA. For high enough noise-reduction value
clusters with have fractal dimension going to when
.Comment: 6 pages, 8 figures, conference CCP201
A Verified Information-Flow Architecture
SAFE is a clean-slate design for a highly secure computer system, with
pervasive mechanisms for tracking and limiting information flows. At the lowest
level, the SAFE hardware supports fine-grained programmable tags, with
efficient and flexible propagation and combination of tags as instructions are
executed. The operating system virtualizes these generic facilities to present
an information-flow abstract machine that allows user programs to label
sensitive data with rich confidentiality policies. We present a formal,
machine-checked model of the key hardware and software mechanisms used to
dynamically control information flow in SAFE and an end-to-end proof of
noninterference for this model.
We use a refinement proof methodology to propagate the noninterference
property of the abstract machine down to the concrete machine level. We use an
intermediate layer in the refinement chain that factors out the details of the
information-flow control policy and devise a code generator for compiling such
information-flow policies into low-level monitor code. Finally, we verify the
correctness of this generator using a dedicated Hoare logic that abstracts from
low-level machine instructions into a reusable set of verified structured code
generators
Matched filters for coalescing binaries detection on massively parallel computers
We discuss some computational problems associated to matched filtering of
experimental signals from gravitational wave interferometric detectors in a
parallel-processing environment. We then specialize our discussion to the use
of the APEmille and apeNEXT processors for this task. Finally, we accurately
estimate the performance of an APEmille system on a computational load
appropriate for the LIGO and VIRGO experiments, and extrapolate our results to
apeNEXT.Comment: 19 pages, 6 figure
Towards Lattice Quantum Chromodynamics on FPGA devices
In this paper we describe a single-node, double precision Field Programmable
Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the
context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we
invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three
Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250
accelerator and the largest device available on the market, the VU13P device.
In our implementation we separate software/hardware parts in such a way that
the entire multiplication by the Dirac operator is performed in hardware, and
the rest of the algorithm runs on the host. We find out that the FPGA
implementation can offer a performance comparable with that obtained using
current CPU or Intel's many core Xeon Phi accelerators. A possible multiple
node FPGA-based system is discussed and we argue that power-efficient High
Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure
Investigating the Dirac operator evaluation with FPGAs
In recent years the computational capacity of single Field Programmable Gate
Arrays (FPGA) devices as well as their versatility has increased significantly.
Adding to that the High Level Synthesis frameworks allowing to program such
processors in a high level language like C++, makes modern FPGA devices a
serious candidate as building blocks of a general purpose High Performance
Computing solution. In this contribution we describe benchmarks which we
performed using a Lattice QCD code, a highly compute-demanding HPC academic
code for elementary particle simulations. We benchmark the performance of a
single FPGA device running in two modes: using the external or embedded memory.
We discuss both approaches in detail using the Xilinx U250 device and provide
estimates for the necessary memory throughput and the minimal amount of
resources needed to deliver optimal performance depending on the available
hardware platform.Comment: 8 pages, 5 figure
- …