276 research outputs found
First Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC
Recent innovations focused around {\em parallel} processing, either through
systems containing multiple processors or processors containing multiple cores,
hold great promise for enhancing the performance of the trigger at the LHC and
extending its physics program. The flexibility of the CMS/ATLAS trigger system
allows for easy integration of computational accelerators, such as NVIDIA's
Tesla Graphics Processing Unit (GPU) or Intel's \xphi, in the High Level
Trigger. These accelerators have the potential to provide faster or more energy
efficient event selection, thus opening up possibilities for new complex
triggers that were not previously feasible. At the same time, it is crucial to
explore the performance limits achievable on the latest generation multicore
CPUs with the use of the best software optimization methods. In this article, a
new tracking algorithm based on the Hough transform will be evaluated for the
first time on a multi-core Intel Xeon E5-2697v2 CPU, an NVIDIA Tesla K20c GPU,
and an Intel \xphi\ 7120 coprocessor. Preliminary time performance will be
presented.Comment: 13 pages, 4 figures, Accepted to JINS
Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing
We investigate implementation of lattice Quantum Chromodynamics (QCD) code on
the Intel Xeon Phi Knights Landing (KNL). The most time consuming part of the
numerical simulations of lattice QCD is a solver of linear equation for a large
sparse matrix that represents the strong interaction among quarks. To establish
widely applicable prescriptions, we examine rather general methods for the SIMD
architecture of KNL, such as using intrinsics and manual prefetching, to the
matrix multiplication and iterative solver algorithms. Based on the performance
measured on the Oakforest-PACS system, we discuss the performance tuning on KNL
as well as the code design for facilitating such tuning on SIMD architecture
and massively parallel machines.Comment: 8 pages, 12 figures. Talk given at LHAM'17 "5th International
Workshop on Legacy HPC Application Migration" in CANDAR'17 "The Fifth
International Symposium on Computing and Networking" and to appear in the
proceeding
Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi
Intel Xeon Phi is a recently released high-performance coprocessor which
features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD
registers achieving a peak theoretical performance of 1Tflop/s in double
precision. Many scientific applications involve operations on large sparse
matrices such as linear solvers, eigensolver, and graph mining algorithms. The
core of most of these applications involves the multiplication of a large,
sparse matrix with a dense vector (SpMV). In this paper, we investigate the
performance of the Xeon Phi coprocessor for SpMV. We first provide a
comprehensive introduction to this new architecture and analyze its peak
performance with a number of micro benchmarks. Although the design of a Xeon
Phi core is not much different than those of the cores in modern processors,
its large number of cores and hyperthreading capability allow many application
to saturate the available memory bandwidth, which is not the case for many
cutting-edge processors. Yet, our performance studies show that it is the
memory latency not the bandwidth which creates a bottleneck for SpMV on this
architecture. Finally, our experiments show that Xeon Phi's sparse kernel
performance is very promising and even better than that of cutting-edge general
purpose processors and GPUs
Massively Parallel Computing at the Large Hadron Collider up to the HL-LHC
As the Large Hadron Collider (LHC) continues its upward progression in energy
and luminosity towards the planned High-Luminosity LHC (HL-LHC) in 2025, the
challenges of the experiments in processing increasingly complex events will
also continue to increase. Improvements in computing technologies and
algorithms will be a key part of the advances necessary to meet this challenge.
Parallel computing techniques, especially those using massively parallel
computing (MPC), promise to be a significant part of this effort. In these
proceedings, we discuss these algorithms in the specific context of a
particularly important problem: the reconstruction of charged particle tracks
in the trigger algorithms in an experiment, in which high computing performance
is critical for executing the track reconstruction in the available time. We
discuss some areas where parallel computing has already shown benefits to the
LHC experiments, and also demonstrate how a MPC-based trigger at the CMS
experiment could not only improve performance, but also extend the reach of the
CMS trigger system to capture events which are currently not practical to
reconstruct at the trigger level.Comment: 14 pages, 6 figures. Proceedings of 2nd International Summer School
on Intelligent Signal Processing for Frontier Research and Industry
(INFIERI2014), to appear in JINST. Revised version in response to referee
comment
Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores
International audienceThe large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs
Optimisation of computational fluid dynamics applications on multicore and manycore architectures
This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems.
The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance.
The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces
- …