Search CORE

786 research outputs found

Recommended from our members

Software Prefetching for Indirect Memory Accesses

Author: Ainsworth Sam
Jones Timothy M
Publication venue: CGO'17: PROCEEDINGS OF THE 2017 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION
Publication date: 01/01/2017
Field of study

\times

and 1.1

\times

for an Intel Haswell processor and an ARM Cortex-A57, both out-of-order cores, and performance improvements of 2.1

\times

and 3.7

\times

for the in-order ARM Cortex-A53 and Intel Xeon Phi

Apollo (Cambridge)

Recommended from our members

Software prefetching for indirect memory accesses: A microarchitectural perspective

Author: Ainsworth S
Jones TM
Publication venue: ACM Transactions on Computer Systems
Publication date: 01/01/2019
Field of study

Many modern data processing and HPC workloads are heavily memory-latency bound. A tempting proposition to solve this is software prefetching, where special non-blocking loads are used to bring data into the cache hierarchy just before being required. However, these are difficult to insert to effectively improve performance, and techniques for automatic insertion are currently limited. This article develops a novel compiler pass to automatically generate software prefetches for indirect memory accesses, a special class of irregular memory accesses often seen in high-performance workloads. We evaluate this across a wide set of systems, all of which gain benefit from the technique. We then evaluate the extent to which good prefetch instructions are architecture dependent and the class of programs that are particularly amenable. Across a set of memory-bound benchmarks, our automated pass achieves average speedups of 1.3× for an Intel Haswell processor, 1.1× for both an ARM Cortex-A57 and Qualcomm Kryo, 1.2× for a Cortex-72 and an Intel Kaby Lake, and 1.35× for an Intel Xeon Phi Knight’s Landing, each of which is an out-of-order core, and performance improvements of 2.1× and 2.7× for the in-order ARM Cortex-A53 and first generation Intel Xeon Phi.EPSRC [EP/K026399/1, EP/M506485/1], ARM Ltd

Apollo (Cambridge)

Recommended from our members

Performance impact of programmer-inserted Data Prefetches for irregular access patterns with a case study of FMM VList algorithm

Author: Tondon Abhishek
Publication venue
Publication date: 22/04/2014
Field of study

textData Prefetching is a well-known technique to speed up applications wherein hardware prefetchers or compilers speculatively prefetch data into caches closer to the processor to ensure it’s readily available when the processor demands it. Since incorrect speculation leads to prefetching useless data which, in turn, results in wasting memory bandwidth and polluting caches, prefetch mechanisms are usually conservative and prefetch on spotting fairly regular access patterns only. This gives the programmer with a knowledge of application, an opportunity to insert fine-grain software prefetches in the code to clinically prefetch the data that is certain to be demanded but whose access pattern is not too obvious for hardware prefetchers or compiler to detect. In this study, the author demonstrates the performance improvement obtained by such programmer-inserted prefetches with the case study of an FMM (Fast Multipole Method) VList application kernel run with several different configurations. The VList computation requires computing the Hadamard product of matrices. However, the way each node of the octree is stored in the memory, leads to indirect accessing of elements where memory accesses themselves are not sequential but the pointers pointing to those memory locations are still stored sequentially. Since compilers do not insert prefetches for indirect accesses, and to hardware, the access pattern appears random, programmer-inserted prefetching is the only solution for such a case. The author demonstrates the performance gain obtained by employing different prefetching choices in terms of what all structures in the code to prefetch and which level of cache to prefetch those to and also presents an analysis of the impact of different configuration parameters on performance gain. The author shows that there are several prefetching combinations which always bring performance gain without ever hurting the performance, and also identifies prefetching to L1 cache and prefetching all data structures in question, as the best prefetching recommendation for this application kernel. It is shown that this one combination gets the highest performance gain for most run configurations and an average performance gain of 10.14% across all run configurations.Electrical and Computer Engineerin

Texas ScholarWorks

IMP: Indirect Memory Prefetcher

Author: Devadas Srinivas
Hughes Christopher J.
Satish Nadathur
Yu Xiangyao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2015
Field of study

Machine learning, graph analytics and sparse linear algebra-based applications are dominated by irregular memory accesses resulting from following edges in a graph or non-zero elements in a sparse matrix. These accesses have little temporal or spatial locality, and thus incur long memory stalls and large bandwidth requirements. A traditional streaming or striding prefetcher cannot capture these irregular access patterns. A majority of these irregular accesses come from indirect patterns of the form A[B[i]]. We propose an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency. We also propose a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on 7 applications, IMP shows 56% speedup on average (up to 2.3×) compared to a baseline 64 core system with streaming prefetchers. This is within 23% of an idealized system. With partial cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).Intel Science and Technology Center for Big Dat

DSpace@MIT

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Author: Kaiser Hartmut
Khatami Zahra
Ramanujam J.
Publication venue
Publication date: 27/03/2017
Field of study

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

arXiv.org e-Print Archive

Crossref

First Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC

Author: Halyo V.
Karpusenko V.
LeGresley P.
Lujan P.
Vladimirov A.
Publication venue: 'IOP Publishing'
Publication date: 28/10/2013
Field of study

Recent innovations focused around {\em parallel} processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA's Tesla Graphics Processing Unit (GPU) or Intel's \xphi, in the High Level Trigger. These accelerators have the potential to provide faster or more energy efficient event selection, thus opening up possibilities for new complex triggers that were not previously feasible. At the same time, it is crucial to explore the performance limits achievable on the latest generation multicore CPUs with the use of the best software optimization methods. In this article, a new tracking algorithm based on the Hough transform will be evaluated for the first time on a multi-core Intel Xeon E5-2697v2 CPU, an NVIDIA Tesla K20c GPU, and an Intel \xphi\ 7120 coprocessor. Preliminary time performance will be presented.Comment: 13 pages, 4 figures, Accepted to JINS

arXiv.org e-Print Archive

CERN Document Server