14,392 research outputs found

    Gauge Field Generation on Large-Scale GPU-Enabled Systems

    Full text link
    Over the past years GPUs have been successfully applied to the task of inverting the fermion matrix in lattice QCD calculations. Even strong scaling to capability-level supercomputers, corresponding to O(100) GPUs or more has been achieved. However strong scaling a whole gauge field generation algorithm to this regim requires significantly more functionality than just having the matrix inverter utilizing the GPUs and has not yet been accomplished. This contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this regime. Initial results are shown for gauge field generation with Chroma simulating pure Wilson fermions on OLCF TitanDev.Comment: The 30th International Symposium on Lattice Field Theory, June 24-29, 2012, Cairns, Australia (Acknowledgment and Citation added

    Parallel Tempering Simulation of the three-dimensional Edwards-Anderson Model with Compact Asynchronous Multispin Coding on GPU

    Get PDF
    Monte Carlo simulations of the Ising model play an important role in the field of computational statistical physics, and they have revealed many properties of the model over the past few decades. However, the effect of frustration due to random disorder, in particular the possible spin glass phase, remains a crucial but poorly understood problem. One of the obstacles in the Monte Carlo simulation of random frustrated systems is their long relaxation time making an efficient parallel implementation on state-of-the-art computation platforms highly desirable. The Graphics Processing Unit (GPU) is such a platform that provides an opportunity to significantly enhance the computational performance and thus gain new insight into this problem. In this paper, we present optimization and tuning approaches for the CUDA implementation of the spin glass simulation on GPUs. We discuss the integration of various design alternatives, such as GPU kernel construction with minimal communication, memory tiling, and look-up tables. We present a binary data format, Compact Asynchronous Multispin Coding (CAMSC), which provides an additional 28.4%28.4\% speedup compared with the traditionally used Asynchronous Multispin Coding (AMSC). Our overall design sustains a performance of 33.5 picoseconds per spin flip attempt for simulating the three-dimensional Edwards-Anderson model with parallel tempering, which significantly improves the performance over existing GPU implementations.Comment: 15 pages, 18 figure

    Morphological diagram of diffusion driven aggregate growth in plane: competition of anisotropy and adhesion

    Full text link
    Two-dimensional structures grown with Witten and Sander algorithm are investigated. We analyze clusters grown off-lattice and clusters grown with antenna method with Nfp=3,4,5,6,7N_{fp}=3,4,5,6,7 and 8 allowed growth directions. With the help of variable probe particles technique we measure fractal dimension of such clusters D(N)D(N) as a function of their size NN. We propose that in the thermodynamic limit of infinite cluster size the aggregates grown with high degree of anisotropy (Nfp=3,4,5N_{fp}=3,4,5) tend to have fractal dimension DD equal to 3/2, while off-lattice aggregates and aggregates with lower anisotropy (Nfp>6N_{fp}>6) have D1.710D \approx 1.710. Noise-reduction procedure results in the change of universality class for DLA. For high enough noise-reduction value clusters with Nfp6N_{fp} \ge 6 have fractal dimension going to 3/23/2 when NN\rightarrow\infty.Comment: 6 pages, 8 figures, conference CCP201

    A Verified Information-Flow Architecture

    Get PDF
    SAFE is a clean-slate design for a highly secure computer system, with pervasive mechanisms for tracking and limiting information flows. At the lowest level, the SAFE hardware supports fine-grained programmable tags, with efficient and flexible propagation and combination of tags as instructions are executed. The operating system virtualizes these generic facilities to present an information-flow abstract machine that allows user programs to label sensitive data with rich confidentiality policies. We present a formal, machine-checked model of the key hardware and software mechanisms used to dynamically control information flow in SAFE and an end-to-end proof of noninterference for this model. We use a refinement proof methodology to propagate the noninterference property of the abstract machine down to the concrete machine level. We use an intermediate layer in the refinement chain that factors out the details of the information-flow control policy and devise a code generator for compiling such information-flow policies into low-level monitor code. Finally, we verify the correctness of this generator using a dedicated Hoare logic that abstracts from low-level machine instructions into a reusable set of verified structured code generators

    Matched filters for coalescing binaries detection on massively parallel computers

    Get PDF
    We discuss some computational problems associated to matched filtering of experimental signals from gravitational wave interferometric detectors in a parallel-processing environment. We then specialize our discussion to the use of the APEmille and apeNEXT processors for this task. Finally, we accurately estimate the performance of an APEmille system on a computational load appropriate for the LIGO and VIRGO experiments, and extrapolate our results to apeNEXT.Comment: 19 pages, 6 figure

    Towards Lattice Quantum Chromodynamics on FPGA devices

    Get PDF
    In this paper we describe a single-node, double precision Field Programmable Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250 accelerator and the largest device available on the market, the VU13P device. In our implementation we separate software/hardware parts in such a way that the entire multiplication by the Dirac operator is performed in hardware, and the rest of the algorithm runs on the host. We find out that the FPGA implementation can offer a performance comparable with that obtained using current CPU or Intel's many core Xeon Phi accelerators. A possible multiple node FPGA-based system is discussed and we argue that power-efficient High Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure

    Investigating the Dirac operator evaluation with FPGAs

    Get PDF
    In recent years the computational capacity of single Field Programmable Gate Arrays (FPGA) devices as well as their versatility has increased significantly. Adding to that the High Level Synthesis frameworks allowing to program such processors in a high level language like C++, makes modern FPGA devices a serious candidate as building blocks of a general purpose High Performance Computing solution. In this contribution we describe benchmarks which we performed using a Lattice QCD code, a highly compute-demanding HPC academic code for elementary particle simulations. We benchmark the performance of a single FPGA device running in two modes: using the external or embedded memory. We discuss both approaches in detail using the Xilinx U250 device and provide estimates for the necessary memory throughput and the minimal amount of resources needed to deliver optimal performance depending on the available hardware platform.Comment: 8 pages, 5 figure
    corecore