3,799 research outputs found
Recommended from our members
Benchmarking the Intel®Xeon®Platinum 8160 Processor
This report presents a set of results for different microbenchmarks and applications on the Intel
Xeon Platinum8160 Processor, formerly known as Skylake. For simplicity, we will use both Skylake
and SKX to refer to this processor. We use the Skylake nodes that will be available in Stampede2.
This systemwill provide Intel Knights Landing and Skylake chips interconnected by a 100 Gb/sec
Intel Omni-Path (OPA) network with a fat tree topology. The peak performance of the system will
be 18 PF.Texas Advanced Computing Center (TACC
Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
Manycores are consolidating in HPC community as a way of improving
performance while keeping power efficiency. Knights Landing is the recently
released second generation of Intel Xeon Phi architecture. While optimizing
applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the
last years, the new features in Knights Landing processors require the revision
of programming and optimization techniques for these devices. In this work, we
selected the Floyd-Warshall algorithm as a representative case study of graph
and memory-bound applications. Starting from the default serial version, we
show how data, thread and compiler level optimizations help the parallel
implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer
and Information Science, vol 79
Wilson and Domainwall Kernels on Oakforest-PACS
We report the performance of Wilson and Domainwall Kernels on a new Intel
Xeon Phi Knights Landing based machine named Oakforest-PACS, which is co-hosted
by University of Tokyo and Tsukuba University and is currently fastest in
Japan. This machine uses Intel Omni-Path for the internode network. We compare
performance with several types of implementation including that makes use of
the Grid library. The code is incorporated with the code set Bridge++.Comment: 8 pages, 9 figures, Proceedings for the 35th International Symposium
on Lattice Field Theory (Lattice 2017
DD-AMG on QPACE 3
We describe our experience porting the Regensburg implementation of the
DD-AMG solver from QPACE 2 to QPACE 3. We first review how the code was
ported from the first generation Intel Xeon Phi processor (Knights Corner) to
its successor (Knights Landing). We then describe the modifications in the
communication library necessitated by the switch from InfiniBand to Omni-Path.
Finally, we present the performance of the code on a single processor as well
as the scaling on many nodes, where in both cases the speedup factor is close
to the theoretical expectations.Comment: 12 pages, 6 figures, Proceedings of Lattice 201
Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures
The Landau collision integral is an accurate model for the small-angle
dominated Coulomb collisions in fusion plasmas. We investigate a high order
accurate, fully conservative, finite element discretization of the nonlinear
multi-species Landau integral with adaptive mesh refinement using the PETSc
library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to
efficiently utilize emerging architectures with an approach that minimizes
memory usage and movement and is suitable for vector processing. The Landau
collision integral is vectorized with Intel AVX-512 intrinsics and the solver
sustains as much as 22% of the theoretical peak flop rate of the Second
Generation Intel Xeon Phi, Knights Landing, processor
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
With recent developments in parallel supercomputing architecture, many core,
multi-core, and GPU processors are now commonplace, resulting in more levels of
parallelism, memory hierarchy, and programming complexity. It has been
necessary to adapt the MILC code to these new processors starting with NVIDIA
GPUs, and more recently, the Intel Xeon Phi processors. We report on our
efforts to port and optimize our code for the Intel Knights Landing
architecture. We consider performance of the MILC code with MPI and OpenMP, and
optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on
the staggered conjugate gradient and gauge force. We also consider performance
on recent NVIDIA GPUs using the QUDA library
MILC staggered conjugate gradient performance on Intel KNL
We review our work done to optimize the staggered conjugate gradient (CG)
algorithm in the MILC code for use with the Intel Knights Landing (KNL)
architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is
capable of massive thread parallelism, data parallelism, and high on-board
memory bandwidth and is being adopted in supercomputing centers for scientific
research. The CG solver consumes the majority of time in production running, so
we have spent most of our effort on it. We compare performance of an MPI+OpenMP
baseline version of the MILC code with a version incorporating the QPhiX
staggered CG solver, for both one-node and multi-node runs.Comment: 8 pages, 4 figure
Performance analysis and optimization of the JOREK code for many-core CPUs
This report investigates the performance of the JOREK code on the Intel
Knights Landing and Skylake processor architectures. The OpenMP scaling of the
matrix construction part of the code was analyzed and improved synchronization
methods were implemented. A new switch was implemented to control the number of
threads used for the linear equation solver independently from other parts of
the code. The matrix construction subroutine was vectorized, and the data
locality was also improved. These steps led to a factor of two speedup for the
matrix construction
Performance analysis on the Intel Knights Landing architecture
One of the emerging architectures in HPC systems is Intel’s Knights Landing (KNL) many core chip, which will also be part of BSC’s next HPC installation MareNostrum 4. KNL is the code name of the second generation of Intel XEON Phi, a many integrated core architecture (MIC) with up to 72 cores with four-time hyper-threading. It includes up to 384 GB of DDR4 RAM and 8-16 GB of stacked MCDRAM, a version of high bandwidth memory. In addition, each core will have two 512-bit vector units and will support AVX-512 SIMD instructions
- …