532 research outputs found
QPACE 2 and Domain Decomposition on the Intel Xeon Phi
We give an overview of QPACE 2, which is a custom-designed supercomputer
based on Intel Xeon Phi processors, developed in a collaboration of Regensburg
University and Eurotech. We give some general recommendations for how to write
high-performance code for the Xeon Phi and then discuss our implementation of a
domain-decomposition-based solver and present a number of benchmarks.Comment: plenary talk at Lattice 2014, to appear in the conference proceedings
PoS(LATTICE2014), 15 pages, 9 figure
Knowledge is power: Quantum chemistry on novel computer architectures
In the first chapter of this thesis, a background of fundamental quantum chemistry concepts is provided. Chapter two contains an analysis of the performance and energy efficiency of various modern computer processor architectures while performing computational chemistry
calculations. In chapter three, the processor architectural study is expanded to include parallel computational chemistry algorithms executed across multiple-node computer clusters. Chapter four describes a novel computational implementation of the fundamental Hartree-Fock method which significantly reduces computer memory requirements. In chapter five, a case study of quantum chemistry two-electron integral code interoperability is described. The final chapters of this work discuss applications of quantum chemistry. In chapter six, an investigation of the esterification of acetic acid on acid-functionalized silica is presented. In chapter seven, the application of ab initio molecular dynamics to study the photoisomerization and photocyclization of stilbene is discussed. Final concluding remarks are noted in chapter eight
Janus II: a new generation application-driven computer for spin-system simulations
This paper describes the architecture, the development and the implementation
of Janus II, a new generation application-driven number cruncher optimized for
Monte Carlo simulations of spin systems (mainly spin glasses). This domain of
computational physics is a recognized grand challenge of high-performance
computing: the resources necessary to study in detail theoretical models that
can make contact with experimental data are by far beyond those available using
commodity computer systems. On the other hand, several specific features of the
associated algorithms suggest that unconventional computer architectures, which
can be implemented with available electronics technologies, may lead to order
of magnitude increases in performance, reducing to acceptable values on human
scales the time needed to carry out simulation campaigns that would take
centuries on commercially available machines. Janus II is one such machine,
recently developed and commissioned, that builds upon and improves on the
successful JANUS machine, which has been used for physics since 2008 and is
still in operation today. This paper describes in detail the motivations behind
the project, the computational requirements, the architecture and the
implementation of this new machine and compares its expected performances with
those of currently available commercial systems.Comment: 28 pages, 6 figure
Acceleration of Large-Scale Electronic Structure Simulations with Heterogeneous Parallel Computing
Large-scale electronic structure simulations coupled to an empirical modeling approach are critical as they present a robust way to predict various quantum phenomena in realistically sized nanoscale structures that are hard to be handled with density functional theory. For tight-binding (TB) simulations of electronic structures that normally involve multimillion atomic systems for a direct comparison to experimentally realizable nanoscale materials and devices, we show that graphical processing unit (GPU) devices help in saving computing costs in terms of time and energy consumption. With a short introduction of the major numerical method adopted for TB simulations of electronic structures, this work presents a detailed description for the strategies to drive performance enhancement with GPU devices against traditional clusters of multicore processors. While this work only uses TB electronic structure simulations for benchmark tests, it can be also utilized as a practical guideline to enhance performance of numerical operations that involve large-scale sparse matrices
Using accelerators to speed up scientific and engineering codes: perspectives and problems
Accelerators are quickly emerging as the leading technology to further boost
computing performances; their main feature is a massively parallel on-chip architecture. NVIDIA
and AMD GPUs and the Intel Xeon-Phi are examples of accelerators available today. Accelerators
are power-efficient and deliver up to one order of magnitude more peak performance than
traditional CPUs. However, existing codes for traditional CPUs require substantial changes to
run efficiently on accelerators, including rewriting with specific programming languages.
In this contribution we present our experience in porting large codes to NVIDIA GPU and Intel
Xeon-Phi accelerators. Our reference application is a CFD code based on the Lattice
Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for
processor architectures with a large degree of parallelism. However, the challenge of
exploiting a large fraction of the theoretically available performance is not easy to
met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a
D2Q37 model), that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the
equation-of-state of a perfect gas.
We describe in details how we implement and optimize our LB code for Xeon-Phi and
GPUs, and then analyze performances on single- and multi-accelerator systems. We
finally compare results with those available on recent traditional multi-core CPUs
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
- …