488 research outputs found

    Developing performance-portable molecular dynamics kernels in Open CL

    Get PDF
    This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandiaโ€™s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMDโ€™s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

    On the design of architecture-aware algorithms for emerging applications

    Get PDF
    This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures.Ph.D.Committee Chair: Bader, David; Committee Member: Hong, Bo; Committee Member: Riley, George; Committee Member: Vuduc, Richard; Committee Member: Wills, Scot

    CPU/GPU ํ†ตํ•ฉ ์•„ํ‚คํ…์ณ์—์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์— ๋Œ€ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Bernhard Egger.Integrated architectures combine CPU and GPU cores onto a single processor die. Thanks to the shared memory architecture, costly memory copies to the GPU device memory can be avoided, and programmers can use both CPU and GPU cores to accelerate their applications. Especially for memory-intensive applications, however, utilizing all available core resources does not always maximize performance due to congestion in the shared memory system. Tuning an application to use the optimal number of CPU and GPU compute resources is a difficult and machine-dependent task. This thesis presents an automated system that auto-tunes an OpenCL kernel for a given integrated architecture on the fly. A light-weight compiler extract the characteristics of a kernel as it is submitted to the OpenCL runtime. We then use a software-based technique to automatically rewrite the kernel to only utilize the compute resources that are expected to maximize performance based on a model. Our analysis shows that the memory access pattern and the cache utilization are the decisive factors when determining the parallelism of a kernel. To accommodate for the varying properties of different integrated architectures, we employ machine learning to create models for different architectures. The models are trained with microbenchmarks that generate a large number of varying memory patterns and evaluated with OpenCL benchmarks from different benchmark suites. While the presented approach shows good results for older integrated architectures such as Intel Skylake and AMD Kaveri, it currently still does not achieve satisfactory results on the more recent architectures Intel Comet Lake and AMD Picasso. However the analysis performed on memory pattern still has some insight on it.CPU์™€ GPU๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ณต์œ ํ•˜๋Š” Integrated graphics(Intel) ๋˜๋Š” APU(AMD) ์‹œ์Šคํ…œ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋™์‹œ์— ์ž‘์—…ํ•˜๋Š” ๊ฒƒ์€ ๋‘ ๊ธฐ๊ธฐ ๊ฐ„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ์— ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํšจ์œจ์ ์œผ๋กœ ๋ณด์ธ๋‹ค. ํ•œํŽธ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด ๋งŽ์€ ๋น„์ค‘์„ ์ฐจ์ง€ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฒฝ์šฐ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŠน์ง•์„ ๊ฐ€์ง„ ๋‘ ๊ธฐ๊ธฐ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋ฉด์„œ ๋ณ‘๋ชฉํ˜„์ƒ์ด ์ผ์–ด๋‚œ๋‹ค. ์ด ๋•Œ ์ž‘์—…์— ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋ฉด์„œ ์ฝ”์–ด๊ฐ€ ์ค‘์ง€๋˜๋Š” ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•˜๋ฉฐ ํผํฌ๋จผ์Šค๋ฅผ ๋‚ฎ์ถ˜๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ ์šฐ๋ฆฌ๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์ฝ”๋“œ์˜ ๋ณ€ํ˜•์„ ํ†ตํ•ด ์ œํ•œ๋œ ๊ฐœ์ˆ˜์˜ ์ฝ”์–ด๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ”์–ด ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ ˆํ•  ๋•Œ ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ์ด ๋•Œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ตœ์  ์ฝ”์–ด ๊ฐœ์ˆ˜์™€ ๊ด€๋ จ์ด ์žˆ์Œ์„ ํ™•์ธํ•˜๊ณ , ์ฝ”๋“œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด๋‹น ํŒจํ„ด์„ ๋ถ„์„ํ•œ๋‹ค. ํผํฌ๋จผ์Šค๋Š” ๋˜ํ•œ ์‹œ์Šคํ…œ์˜ ํŠน์ง•๊ณผ๋„ ๊ด€๋ จ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด๋“ค์„ ๋‹ด๊ณ  ์žˆ๋Š” ๋งˆ์ดํฌ๋กœ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์ด๋ฅผ ์‹œ์Šคํ…œ์—์„œ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์˜ ํ•™์Šต์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์˜ ํŠน์ง•์ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•„ ๋ชจ๋ธ์€ ์ข‹์€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์ง€ ๋ชปํ–ˆ๋‹ค. ๋น„๋ก ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€ ๋ชปํ–ˆ์ง€๋งŒ, CPU-GPU ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ๋Œ€ํ•œ ๋ถ„์„์œผ๋กœ์„œ ์˜๋ฏธ๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค.Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Integrated CPU-GPU Architecture . . . . . . . . . . . . . . . . . 4 2.2 OpenCL Programming . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3 Implementation 8 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Kernel Modification . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 GPU Throughput Analysis . . . . . . . . . . . . . . . . . . . . . 10 3.3.1 Intra-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.2 Inter-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 4 Experiment 19 4.1 Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Target Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Micro-benchmark Results . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Real Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . 21 4.5 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 5 Conclusion 24 Bibliography 24 ์š”์•ฝ 28์„

    Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS

    Get PDF
    Master'sMASTER OF SCIENC

    Impact Of Thread Scheduling On Modern Gpus

    Get PDF
    The Graphics Processing Unit (GPU) has become a more important component in high-performance computing systems as it accelerates data and compute intensive applications significantly with less cost and power. The GPU achieves high performance by executing massive number of threads in parallel in a SPMD (Single Program Multiple Data) fashion. Threads are grouped into workgroups by programmer and workgroups are then assigned to each compute core on the GPU by hardware. Once assigned, a workgroup is further subgrouped into wavefronts of the fixed number of threads by hardware when executed in a SIMD (Single Instruction Multiple Data) fashion. In this thesis, we investigated the impact of thread (at workgroup and wavefront level) scheduling on overall hardware utilization and performance. We implement four different thread schedulers: Two-level wavefront scheduler, Lookahead wavefront scheduler and Two-level + Lookahead wavefront scheduler, and Block workgroup scheduler. We implement and test these schedulers on a cycle accurate detailed architectural simulator called Multi2Sim targeting AMD\u27s latest Graphics Core Next (GCN) architecture. Our extensive evaluation and analysis show that using some of these alternate mechanisms, cache hit rate is improved by an average of 30% compared to the baseline round-robin scheduler, thus drastically reducing the number of stalls caused by long latency memory operations. We also observe that some of these schedulers improve overall performance by an average of 17% compared to the baseline

    Exploring Application Performance on Emerging Hybrid-Memory Supercomputers

    Full text link
    Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of "fast" and "slow" memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and Communications, IEEE, 201

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
    • โ€ฆ
    corecore