2,307 research outputs found
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
Master of Science
thesisTo address the need of understanding and optimizing the performance of complex applications and achieving sustained application performance across different architectures, we need performance models and tools that could quantify the theoretical performance and the resultant gap between theoretical and observed performance. This thesis proposes a benchmark-driven Roofline Model Toolkit to provide theoretical and achievable performance, and their resultant gap for multicore, manycore, and accelerated architectures. Roofline micro benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these micro benchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism(TLP), instruction-level parallelism(ILP), and explicit Single Instruction, Multiple Data(SIMD) parallelism, measured in the context of the compilers and runtime environment on the target architecture. We also developed benchmarks to explore detailed memory subsystems behaviors and evaluate parallelization overhead. Beyond on-chip performance, we measure sustained Peripheral Component Interconnect Express(PCIe) throughput with four Graphics Processing Unit(GPU) memory managed mechanisms. By combining results from the architecture characterization with the Roofline Model based solely on architectural specification, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline Model when run on a Blue Gene/Q architecture
Power, Performance, and Energy Management of Heterogeneous Architectures
abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance
optimization opportunities by tuning thousands of potential voltage, frequency
and core configurations. Applications running on these architectures are becoming increasingly
complex. As the basic building blocks, which make up the application, change during
runtime, different configurations may become optimal with respect to power, performance
or other metrics. Identifying the optimal configuration at runtime is a daunting task due
to a large number of workloads and configurations. Therefore, there is a strong need to
evaluate the metrics of interest as a function of the supported configurations.
This thesis focuses on two different types of modern multiprocessor systems-on-chip
(SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture.
For mobile heterogeneous systems, this thesis presents a novel methodology that can
accurately instrument different types of applications with specific performance monitoring
calls. These calls provide a rich set of performance statistics at a basic block level while the
application runs on the target platform. The target architecture used for this work (Odroid
XU3) is capable of running at 4940 different frequency and core combinations. With the
help of instrumented application vast amount of characterization data is collected that provides
details about performance, power and CPU state at every instrumented basic block
across 19 different types of applications. The vast amount of data collected has enabled
two runtime schemes. The first work provides a methodology to find optimal configurations
in heterogeneous architecture using classifiers and demonstrates an average increase
of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and
powersave governors, respectively. The second work using same data shows a novel imitation
learning framework for dynamically controlling the type, number, and the frequencies
of active cores to achieve an average of 109% PPW improvement compared to the default
governors.
This work also presents how to accurately profile tile based Intel Xeon Phi architecture
while training different types of neural networks using open image dataset on deep learning
framework. The data collected allows deep exploratory analysis. It also showcases how
different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201
- …