3 research outputs found
Evaluating the performance of legacy applications on emerging parallel architectures
The gap between a supercomputer's theoretical maximum (\peak")
oatingpoint
performance and that actually achieved by applications has grown wider
over time. Today, a typical scientific application achieves only 5{20% of any
given machine's peak processing capability, and this gap leaves room for significant
improvements in execution times.
This problem is most pronounced for modern \accelerator" architectures
{ collections of hundreds of simple, low-clocked cores capable of executing the
same instruction on dozens of pieces of data simultaneously. This is a significant
change from the low number of high-clocked cores found in traditional CPUs,
and effective utilisation of accelerators typically requires extensive code and
algorithmic changes. In many cases, the best way in which to map a parallel
workload to these new architectures is unclear.
The principle focus of the work presented in this thesis is the evaluation
of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel
MIC) for two benchmark codes { the LU benchmark from the NAS Parallel
Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex
parallel behaviours that are representative of many scientific applications. Using
combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we
demonstrate performance improvements of up to 7x for these workloads.
We also detail a code development methodology that permits application developers
to target multiple architecture types without maintaining completely
separate implementations for each platform. Using OpenCL, we develop performance
portable implementations of the LU and miniMD benchmarks that are
faster than the original codes, and at most 2x slower than versions highly-tuned
for particular hardware.
Finally, we demonstrate the importance of evaluating architectures at scale
(as opposed to on single nodes) through performance modelling techniques,
highlighting the problems associated with strong-scaling on emerging accelerator
architectures
WMTools - assessing parallel application memory utilisation at scale
The divergence between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The recent use of multi-core processor designs has, however, brought new problems to the design of memory architectures - as more cores are added to each successive generation of processor, equivalent improvement in memory capacity and memory sub-systems must be made if the compute components of the processor are to remain sufficiently supplied with data. These issues combined with the traditional problem of designing cache-efficient code help to ensure that memory remains an on-going challenge for application and machine designers. In this paper we present a comprehensive discussion of WMTools - a trace-based toolkit designed to support the analysis of memory allocation for parallel applications. This paper features an extended discussion of the WMTrace tracing tool presented in previous work including a revised discussion on trace-compression and several refinements to the tracing methodology to reduce overheads and improve tool scalability. The second half of this paper features a case study in which we apply WMTools to five parallel scientific applications and benchmarks, demon- strating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes. This study demonstrates the use of WMTools in elucidating memory allocation issues in high-performance scientific codes