488 research outputs found
Developing performance-portable molecular dynamics kernels in Open CL
This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandiaโs miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs.
We demonstrate that the performance bottlenecks of miniMDโs short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture
On the design of architecture-aware algorithms for emerging applications
This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology.
We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures.Ph.D.Committee Chair: Bader, David; Committee Member: Hong, Bo; Committee Member: Riley, George; Committee Member: Vuduc, Richard; Committee Member: Wills, Scot
CPU/GPU ํตํฉ ์ํคํ ์ณ์์์ ๋ฉ๋ชจ๋ฆฌ ์ง์ฝ์ ์ดํ๋ฆฌ์ผ์ด์ ์ ๋ํ ๋ณ๋ ฌ์ฑ ๊ด๋ฆฌ
ํ์๋
ผ๋ฌธ(์์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ, 2021.8. Bernhard Egger.Integrated architectures combine CPU and GPU cores onto a single processor die. Thanks to the shared memory architecture, costly memory copies to the GPU device memory can be avoided, and programmers can use both CPU and GPU cores to accelerate their applications.
Especially for memory-intensive applications, however, utilizing all available core resources does not always maximize performance due to congestion in the shared memory system. Tuning an application to use the optimal number of CPU and GPU compute resources is a difficult and machine-dependent task.
This thesis presents an automated system that auto-tunes an OpenCL kernel for a given integrated architecture on the fly. A light-weight compiler extract the characteristics of a kernel as it is submitted to the OpenCL runtime. We then use a software-based technique to automatically rewrite the kernel to only utilize the compute resources that are expected to maximize performance based on a model.
Our analysis shows that the memory access pattern and the cache utilization are the decisive factors when determining the parallelism of a kernel. To accommodate for the varying properties of different integrated architectures, we employ machine learning to create models for different architectures. The models are trained with microbenchmarks that generate a large number of varying memory patterns and evaluated with OpenCL benchmarks from different benchmark suites.
While the presented approach shows good results for older integrated architectures such as Intel Skylake and AMD Kaveri, it currently still does not achieve satisfactory results on the more recent architectures Intel Comet Lake and AMD Picasso. However the analysis performed on memory pattern still has some insight on it.CPU์ GPU๊ฐ ๋ฉ๋ชจ๋ฆฌ๋ฅผ ๊ณต์ ํ๋ Integrated graphics(Intel) ๋๋ APU(AMD) ์์คํ
์์ ์ดํ๋ฆฌ์ผ์ด์
์ ๋์์ ์์
ํ๋ ๊ฒ์ ๋ ๊ธฐ๊ธฐ ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ ๋ณต์ฌ์ ๊ฑธ๋ฆฌ๋ ์๊ฐ์ ์ ์ฝํ ์ ์๊ธฐ ๋๋ฌธ์ ํจ์จ์ ์ผ๋ก ๋ณด์ธ๋ค. ํํธ์ผ๋ก ๋ฉ๋ชจ๋ฆฌ ์ฐ์ฐ์ด ๋ง์ ๋น์ค์ ์ฐจ์งํ๋ ์ดํ๋ฆฌ์ผ์ด์
์ ๊ฒฝ์ฐ, ์๋ก ๋ค๋ฅธ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ ํน์ง์ ๊ฐ์ง ๋ ๊ธฐ๊ธฐ๊ฐ ๋์์ ๊ฐ์ ๋ฉ๋ชจ๋ฆฌ์ ์ ๊ทผํ๋ฉด์ ๋ณ๋ชฉํ์์ด ์ผ์ด๋๋ค. ์ด ๋ ์์
์ ํ์ํ ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ค๋ฆฌ๋ฉด์ ์ฝ์ด๊ฐ ์ค์ง๋๋ ์ํฉ์ด ๋ฐ์ํ๋ฉฐ ํผํฌ๋จผ์ค๋ฅผ ๋ฎ์ถ๋ค.
์ด ๋
ผ๋ฌธ์์ ์ฐ๋ฆฌ๋ ์ํํธ์จ์ด ์ฝ๋์ ๋ณํ์ ํตํด ์ ํ๋ ๊ฐ์์ ์ฝ์ด๋ง์ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ์๊ฐํ๋ค. ๊ทธ๋ฆฌ๊ณ ์ด ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ฌ ์ฝ์ด ๊ฐ์๋ฅผ ์กฐ์ ํ ๋ ๋ชจ๋ ์ฝ์ด๋ฅผ ์ฌ์ฉํ๋ ๊ฒ๋ณด๋ค ๋ ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ผ ์ ์์์ ๋ณด์ธ๋ค. ์ด ๋ ์ดํ๋ฆฌ์ผ์ด์
์ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ ํจํด์ด ์ต์ ์ฝ์ด ๊ฐ์์ ๊ด๋ จ์ด ์์์ ํ์ธํ๊ณ , ์ฝ๋๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํด๋น ํจํด์ ๋ถ์ํ๋ค.
ํผํฌ๋จผ์ค๋ ๋ํ ์์คํ
์ ํน์ง๊ณผ๋ ๊ด๋ จ์ด ์๋ค. ๋ฐ๋ผ์ ์ฐ๋ฆฌ๋ ๊ฐ๋ฅํ ๋ฉ๋ชจ๋ฆฌ ํจํด๋ค์ ๋ด๊ณ ์๋ ๋ง์ดํฌ๋ก ๋ฒค์น๋งํฌ๋ฅผ ์์ฑํ์ฌ ์ด๋ฅผ ์์คํ
์์ ํ
์คํธํ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๊ณํ์ต ๋ชจ๋ธ์ ํ์ต์ ์ฌ์ฉํ์๋ค. ๊ทธ๋ฌ๋ ๋ฉ๋ชจ๋ฆฌ ํจํด์ ํน์ง์ด ์ถฉ๋ถํ์ง ์์ ๋ชจ๋ธ์ ์ข์ ์์ธก ๊ฒฐ๊ณผ๋ฅผ ๋ด์ง ๋ชปํ๋ค. ๋น๋ก ๋ชจ๋ธ ์์ธก ๊ฒฐ๊ณผ๊ฐ ์ข์ง ๋ชปํ์ง๋ง, CPU-GPU ํตํฉ ์ํคํ
์ฒ์ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ์ ๋ํ ๋ถ์์ผ๋ก์ ์๋ฏธ๊ฐ ์์ ๊ฒ์ด๋ค.Chapter 1 Introduction 1
Chapter 2 Background 4
2.1 Integrated CPU-GPU Architecture . . . . . . . . . . . . . . . . . 4
2.2 OpenCL Programming . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 3 Implementation 8
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Kernel Modification . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 GPU Throughput Analysis . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Intra-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.2 Inter-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 16
3.4.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 4 Experiment 19
4.1 Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Target Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Micro-benchmark Results . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Real Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . 21
4.5 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Chapter 5 Conclusion 24
Bibliography 24
์์ฝ 28์
Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS
Master'sMASTER OF SCIENC
Impact Of Thread Scheduling On Modern Gpus
The Graphics Processing Unit (GPU) has become a more important component in high-performance computing systems as it accelerates data and compute intensive applications significantly with less cost and power. The GPU achieves high performance by executing massive number of threads in parallel in a SPMD (Single Program Multiple Data) fashion. Threads are grouped into workgroups by programmer and workgroups are then assigned to each compute core on the GPU by hardware. Once assigned, a workgroup is further subgrouped into wavefronts of the fixed number of threads by hardware when executed in a SIMD (Single Instruction Multiple Data) fashion. In this thesis, we investigated the impact of thread (at workgroup and wavefront level) scheduling on overall hardware utilization and performance. We implement four different thread schedulers: Two-level wavefront scheduler, Lookahead wavefront scheduler and Two-level + Lookahead wavefront scheduler, and Block workgroup scheduler. We implement and test these schedulers on a cycle accurate detailed architectural simulator called Multi2Sim targeting AMD\u27s latest Graphics Core Next (GCN) architecture. Our extensive evaluation and analysis show that using some of these alternate mechanisms, cache hit rate is improved by an average of 30% compared to the baseline round-robin scheduler, thus drastically reducing the number of stalls caused by long latency memory operations. We also observe that some of these schedulers improve overall performance by an average of 17% compared to the baseline
Exploring Application Performance on Emerging Hybrid-Memory Supercomputers
Next-generation supercomputers will feature more hierarchical and
heterogeneous memory systems with different memory technologies working
side-by-side. A critical question is whether at large scale existing HPC
applications and emerging data-analytics workloads will have performance
improvement or degradation on these systems. We propose a systematic and fair
methodology to identify the trend of application performance on emerging
hybrid-memory systems. We model the memory system of next-generation
supercomputers as a combination of "fast" and "slow" memories. We then analyze
performance and dynamic execution characteristics of a variety of workloads,
from traditional scientific applications to emerging data analytics to compare
traditional and hybrid-memory systems. Our results show that data analytics
applications can clearly benefit from the new system design, especially at
large scale. Moreover, hybrid-memory systems do not penalize traditional
scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and
Communications, IEEE, 201
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
- โฆ