8,693 research outputs found
Efficient multicore-aware parallelization strategies for iterative stencil computations
Stencil computations consume a major part of runtime in many scientific
simulation codes. As prototypes for this class of algorithms we consider the
iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient
parallel implementations for cache-based multicore architectures. Temporal
cache blocking is a known advanced optimization technique, which can reduce the
pressure on the memory bus significantly. We apply and refine this optimization
for a recently presented temporal blocking strategy designed to explicitly
utilize multicore characteristics. Especially for the case of Gauss-Seidel
smoothers we show that simultaneous multi-threading (SMT) can yield substantial
performance improvements for our optimized algorithm.Comment: 15 pages, 10 figure
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
OS Scheduling Algorithms for Memory Intensive Workloads in Multi-socket Multi-core servers
Major chip manufacturers have all introduced multicore microprocessors.
Multi-socket systems built from these processors are routinely used for running
various server applications. Depending on the application that is run on the
system, remote memory accesses can impact overall performance. This paper
presents a new operating system (OS) scheduling optimization to reduce the
impact of such remote memory accesses. By observing the pattern of local and
remote DRAM accesses for every thread in each scheduling quantum and applying
different algorithms, we come up with a new schedule of threads for the next
quantum. This new schedule potentially cuts down remote DRAM accesses for the
next scheduling quantum and improves overall performance. We present three such
new algorithms of varying complexity followed by an algorithm which is an
adaptation of Hungarian algorithm. We used three different synthetic workloads
to evaluate the algorithm. We also performed sensitivity analysis with respect
to varying DRAM latency. We show that these algorithms can cut down DRAM access
latency by up to 55% depending on the algorithm used. The benefit gained from
the algorithms is dependent upon their complexity. In general higher the
complexity higher is the benefit. Hungarian algorithm results in an optimal
solution. We find that two out of four algorithms provide a good trade-off
between performance and complexity for the workloads we studied
LIKWID: Lightweight Performance Tools
Exploiting the performance of today's microprocessors requires intimate
knowledge of the microarchitecture as well as an awareness of the ever-growing
complexity in thread and cache topology. LIKWID is a set of command line
utilities that addresses four key problems: Probing the thread and cache
topology of a shared-memory node, enforcing thread-core affinity on a program,
measuring performance counter metrics, and microbenchmarking for reliable upper
performance bounds. Moreover, it includes a mpirun wrapper allowing for
portable thread-core affinity in MPI and hybrid MPI/threaded applications. To
demonstrate the capabilities of the tool set we show the influence of thread
affinity on performance using the well-known OpenMP STREAM triad benchmark, use
hardware counter tools to study the performance of a stencil code, and finally
show how to detect bandwidth problems on ccNUMA-based compute nodes.Comment: 12 page
Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
Manycores are consolidating in HPC community as a way of improving
performance while keeping power efficiency. Knights Landing is the recently
released second generation of Intel Xeon Phi architecture. While optimizing
applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the
last years, the new features in Knights Landing processors require the revision
of programming and optimization techniques for these devices. In this work, we
selected the Floyd-Warshall algorithm as a representative case study of graph
and memory-bound applications. Starting from the default serial version, we
show how data, thread and compiler level optimizations help the parallel
implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer
and Information Science, vol 79
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
- …