Search CORE

3,799 research outputs found

Recommended from our members

Benchmarking the Intel®Xeon®Platinum 8160 Processor

Author: Chen Feng
Gómez-Iglesias Antonio
Huang Lei
Liu Hang
Liu Si
Rosales Carlos
Publication venue
Publication date: 10/08/2017
Field of study

This report presents a set of results for different microbenchmarks and applications on the Intel Xeon Platinum8160 Processor, formerly known as Skylake. For simplicity, we will use both Skylake and SKX to refer to this processor. We use the Skylake nodes that will be available in Stampede2. This systemwill provide Intel Knights Landing and Skylake chips interconnected by a 100 Gb/sec Intel Omni-Path (OPA) network with a fat tree topology. The peak performance of the system will be 18 PF.Texas Advanced Computing Center (TACC

Texas ScholarWorks

Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study

Author: A Nakaya
C Rosales
DE Culler
G Venkataraman
J Reinders
MB Giles
RW Floyd
S Jalali
S Warshall
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/11/2018
Field of study

Manycores are consolidating in HPC community as a way of improving performance while keeping power efficiency. Knights Landing is the recently released second generation of Intel Xeon Phi architecture. While optimizing applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the last years, the new features in Knights Landing processors require the revision of programming and optimization techniques for these devices. In this work, we selected the Floyd-Warshall algorithm as a representative case study of graph and memory-bound applications. Starting from the default serial version, we show how data, thread and compiler level optimizations help the parallel implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer and Information Science, vol 79

arXiv.org e-Print Archive

Crossref

Wilson and Domainwall Kernels on Oakforest-PACS

Author: Kanamori Issaku
Matsufuru Hideo
Publication venue: 'EDP Sciences'
Publication date: 18/10/2017
Field of study

We report the performance of Wilson and Domainwall Kernels on a new Intel Xeon Phi Knights Landing based machine named Oakforest-PACS, which is co-hosted by University of Tokyo and Tsukuba University and is currently fastest in Japan. This machine uses Intel Omni-Path for the internode network. We compare performance with several types of implementation including that makes use of the Grid library. The code is incorporated with the code set Bridge++.Comment: 8 pages, 9 figures, Proceedings for the 35th International Symposium on Lattice Field Theory (Lattice 2017

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

DD- $\alpha$ AMG on QPACE 3

Author: Georg Peter
Richtmann Daniel
Wettig Tilo
Publication venue: 'EDP Sciences'
Publication date: 19/10/2017
Field of study

We describe our experience porting the Regensburg implementation of the DD-

\alpha

AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.Comment: 12 pages, 6 figures, Proceedings of Lattice 201

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures

Author: Adams M. F.
Brown J.
Hirvijoki E.
Isaac T.
Knepley M. G.
Mills R.
Publication venue
Publication date: 01/01/2017
Field of study

The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emerging architectures with an approach that minimizes memory usage and movement and is suitable for vector processing. The Landau collision integral is vectorized with Intel AVX-512 intrinsics and the solver sustains as much as 22% of the theoretical peak flop rate of the Second Generation Intel Xeon Phi, Knights Landing, processor

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

DSpace at Rice University

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Author: DeTar Carleton
Gottlieb Steven
Li Ruizi
Toussaint Doug
Publication venue
Publication date: 30/11/2017
Field of study

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

MILC staggered conjugate gradient performance on Intel KNL

Author: DeTar Carleton
Doerfler Douglas
Gottlieb Steven
Jha Ashish
Kalamkar Dhiraj
Li Ruizi
Toussaint Doug
Publication venue
Publication date: 01/01/2016
Field of study

We review our work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is capable of massive thread parallelism, data parallelism, and high on-board memory bandwidth and is being adopted in supercomputing centers for scientific research. The CG solver consumes the majority of time in production running, so we have spent most of our effort on it. We compare performance of an MPI+OpenMP baseline version of the MILC code with a version incorporating the QPhiX staggered CG solver, for both one-node and multi-node runs.Comment: 8 pages, 4 figure

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Performance analysis and optimization of the JOREK code for many-core CPUs

Author: Fehér T. B.
Huijsmans G. T. A.
Hölzl M.
Latu G.
Publication venue
Publication date: 01/01/2018
Field of study

This report investigates the performance of the JOREK code on the Intel Knights Landing and Skylake processor architectures. The OpenMP scaling of the matrix construction part of the code was analyzed and improved synchronization methods were implemented. A new switch was implemented to control the number of threads used for the linear equation solver independently from other parts of the code. The matrix construction subroutine was vectorized, and the data locality was also improved. These steps led to a factor of two speedup for the matrix construction

arXiv.org e-Print Archive

Repository TU/e

Pure OAI Repository

Performance analysis on the Intel Knights Landing architecture

Author: Gimenez Judit
Wagner Michael
Publication venue: Barcelona Supercomputing Center
Publication date: 04/05/2017
Field of study

One of the emerging architectures in HPC systems is Intel’s Knights Landing (KNL) many core chip, which will also be part of BSC’s next HPC installation MareNostrum 4. KNL is the code name of the second generation of Intel XEON Phi, a many integrated core architecture (MIC) with up to 72 cores with four-time hyper-threading. It includes up to 384 GB of DDR4 RAM and 8-16 GB of stacked MCDRAM, a version of high bandwidth memory. In addition, each core will have two 512-bit vector units and will support AVX-512 SIMD instructions

UPCommons. Portal del coneixement obert de la UPC