Search CORE

488 research outputs found

Developing performance-portable molecular dynamics kernels in Open CL

Author: Jarvis Stephen A.
Pennycook Simon J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2012
Field of study

This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia’s miniMD benchmark that achieves good levels of performance across a wide range of hardware: CPUs, discrete GPUs and integrated GPUs. We demonstrate that the performance bottlenecks of miniMD’s short-range force calculation kernel are the same across these architectures, and detail a number of platform- agnostic optimisations that improve its performance by at least 2x on all hardware considered. Our complete code is shown to be 1.7x faster than the original miniMD, and at most 2x slower than implementations individually hand-tuned for a specific architecture

Warwick Research Archives Portal Repository

On the design of architecture-aware algorithms for emerging applications

Author: Kang Seunghwa
Publication venue: Georgia Institute of Technology
Publication date: 30/01/2011
Field of study

This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures.Ph.D.Committee Chair: Bader, David; Committee Member: Hong, Bo; Committee Member: Riley, George; Committee Member: Vuduc, Richard; Committee Member: Wills, Scot

Scholarly Materials And Research @ Georgia Tech

CPU/GPU 통합 아키텍쳐에서의 메모리 집약적 어플리케이션에 대한 병렬성 관리

Author: 박지연
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. Bernhard Egger.Integrated architectures combine CPU and GPU cores onto a single processor die. Thanks to the shared memory architecture, costly memory copies to the GPU device memory can be avoided, and programmers can use both CPU and GPU cores to accelerate their applications. Especially for memory-intensive applications, however, utilizing all available core resources does not always maximize performance due to congestion in the shared memory system. Tuning an application to use the optimal number of CPU and GPU compute resources is a difficult and machine-dependent task. This thesis presents an automated system that auto-tunes an OpenCL kernel for a given integrated architecture on the fly. A light-weight compiler extract the characteristics of a kernel as it is submitted to the OpenCL runtime. We then use a software-based technique to automatically rewrite the kernel to only utilize the compute resources that are expected to maximize performance based on a model. Our analysis shows that the memory access pattern and the cache utilization are the decisive factors when determining the parallelism of a kernel. To accommodate for the varying properties of different integrated architectures, we employ machine learning to create models for different architectures. The models are trained with microbenchmarks that generate a large number of varying memory patterns and evaluated with OpenCL benchmarks from different benchmark suites. While the presented approach shows good results for older integrated architectures such as Intel Skylake and AMD Kaveri, it currently still does not achieve satisfactory results on the more recent architectures Intel Comet Lake and AMD Picasso. However the analysis performed on memory pattern still has some insight on it.CPU와 GPU가 메모리를 공유하는 Integrated graphics(Intel) 또는 APU(AMD) 시스템에서 어플리케이션을 동시에 작업하는 것은 두 기기 간의 메모리 복사에 걸리는 시간을 절약할 수 있기 때문에 효율적으로 보인다. 한편으로 메모리 연산이 많은 비중을 차지하는 어플리케이션의 경우, 서로 다른 메모리 접근 특징을 가진 두 기기가 동시에 같은 메모리에 접근하면서 병목현상이 일어난다. 이 때 작업에 필요한 데이터를 기다리면서 코어가 중지되는 상황이 발생하며 퍼포먼스를 낮춘다. 이 논문에서 우리는 소프트웨어 코드의 변형을 통해 제한된 개수의 코어만을 사용하는 방법을 소개한다. 그리고 이 방법을 사용하여 코어 개수를 조절할 때 모든 코어를 사용하는 것보다 더 좋은 결과를 낼 수 있음을 보인다. 이 때 어플리케이션의 메모리 접근 패턴이 최적 코어 개수와 관련이 있음을 확인하고, 코드를 기반으로 해당 패턴을 분석한다. 퍼포먼스는 또한 시스템의 특징과도 관련이 있다. 따라서 우리는 가능한 메모리 패턴들을 담고 있는 마이크로 벤치마크를 생성하여 이를 시스템에서 테스트한 결과를 기계학습 모델의 학습에 사용하였다. 그러나 메모리 패턴의 특징이 충분하지 않아 모델은 좋은 예측 결과를 내지 못했다. 비록 모델 예측 결과가 좋지 못했지만, CPU-GPU 통합 아키텍처의 메모리 접근에 대한 분석으로서 의미가 있을 것이다.Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Integrated CPU-GPU Architecture . . . . . . . . . . . . . . . . . 4 2.2 OpenCL Programming . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 3 Implementation 8 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Kernel Modification . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 GPU Throughput Analysis . . . . . . . . . . . . . . . . . . . . . 10 3.3.1 Intra-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.2 Inter-Thread . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 4 Experiment 19 4.1 Throughput Measurement . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Target Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3 Micro-benchmark Results . . . . . . . . . . . . . . . . . . . . . . 21 4.4 Real Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . 21 4.5 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 5 Conclusion 24 Bibliography 24 요약 28석

SNU Open Repository and Archive

Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS

Author: JIAO QING
Publication venue
Publication date: 22/09/2014
Field of study

Master'sMASTER OF SCIENC

ScholarBank@NUS

Impact Of Thread Scheduling On Modern Gpus

Author: Addoh Orevaoghene
Publication venue: eGrove
Publication date: 01/01/2014
Field of study

The Graphics Processing Unit (GPU) has become a more important component in high-performance computing systems as it accelerates data and compute intensive applications significantly with less cost and power. The GPU achieves high performance by executing massive number of threads in parallel in a SPMD (Single Program Multiple Data) fashion. Threads are grouped into workgroups by programmer and workgroups are then assigned to each compute core on the GPU by hardware. Once assigned, a workgroup is further subgrouped into wavefronts of the fixed number of threads by hardware when executed in a SIMD (Single Instruction Multiple Data) fashion. In this thesis, we investigated the impact of thread (at workgroup and wavefront level) scheduling on overall hardware utilization and performance. We implement four different thread schedulers: Two-level wavefront scheduler, Lookahead wavefront scheduler and Two-level + Lookahead wavefront scheduler, and Block workgroup scheduler. We implement and test these schedulers on a cycle accurate detailed architectural simulator called Multi2Sim targeting AMD\u27s latest Graphics Core Next (GCN) architecture. Our extensive evaluation and analysis show that using some of these alternate mechanisms, cache hit rate is improved by an average of 30% compared to the baseline round-robin scheduler, thus drastically reducing the number of stalls caused by long latency memory operations. We also observe that some of these schedulers improve overall performance by an average of 17% compared to the baseline

eGrove (Univ. of Mississippi)

Exploring Application Performance on Emerging Hybrid-Memory Supercomputers

Author: Gioiosa Roberto
Kestor Gokcen
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/04/2017
Field of study

Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of "fast" and "slow" memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and Communications, IEEE, 201

arXiv.org e-Print Archive

Crossref

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Author: Awan Ammar Ahmad
Bedorf Jeroen
Chu Ching-Hsiang
Panda Dhabaleswar K.
Subramoni Hari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/10/2018
Field of study

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

arXiv.org e-Print Archive

Crossref