5,389 research outputs found
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
A Phase Field Model for Continuous Clustering on Vector Fields
A new method for the simplification of flow fields is presented. It is based on continuous clustering. A well-known physical clustering model, the Cahn Hilliard model, which describes phase separation, is modified to reflect the properties of the data to be visualized. Clusters are defined implicitly as connected components of the positivity set of a density function. An evolution equation for this function is obtained as a suitable gradient flow of an underlying anisotropic energy functional. Here, time serves as the scale parameter. The evolution is characterized by a successive coarsening of patterns-the actual clustering-during which the underlying simulation data specifies preferable pattern boundaries. We introduce specific physical quantities in the simulation to control the shape, orientation and distribution of the clusters as a function of the underlying flow field. In addition, the model is expanded, involving elastic effects. In the early stages of the evolution shear layer type representation of the flow field can thereby be generated, whereas, for later stages, the distribution of clusters can be influenced. Furthermore, we incorporate upwind ideas to give the clusters an oriented drop-shaped appearance. Here, we discuss the applicability of this new type of approach mainly for flow fields, where the cluster energy penalizes cross streamline boundaries. However, the method also carries provisions for other fields as well. The clusters can be displayed directly as a flow texture. Alternatively, the clusters can be visualized by iconic representations, which are positioned by using a skeletonization algorithm.
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
- …