290 research outputs found

    A Survey of Techniques for Improving Security of GPUs

    Full text link
    Graphics processing unit (GPU), although a powerful performance-booster, also has many security vulnerabilities. Due to these, the GPU can act as a safe-haven for stealthy malware and the weakest `link' in the security `chain'. In this paper, we present a survey of techniques for analyzing and improving GPU security. We classify the works on key attributes to highlight their similarities and differences. More than informing users and researchers about GPU security techniques, this survey aims to increase their awareness about GPU security vulnerabilities and potential countermeasures

    Energy efficient run-time mapping and thread partitioning of concurrent OpenCL applications on CPU-GPU MPSoCs

    Get PDF
    Heterogeneous Multi-Processor Systems-on-Chips (MPSoCs) containing CPU and GPU cores are typically required to execute applications concurrently. However, as will be shown in this paper, existing approaches are not well suited for concurrent applications as they are developed either by considering only a single application or they do not exploit both CPU and GPU cores at the same time. In this paper, we propose an energy-efficient run-time mapping and thread partitioning approach for executing concurrent OpenCL applications on both GPU and GPU cores while satisfying performance requirements. Depending upon the performance requirements, for each concurrently executing application, the mapping process finds the appropriate number of CPU cores and operating frequencies of CPU and GPU cores, and the partitioning process identifies an efficient partitioning of the applications’ threads between CPU and GPU cores. We validate the proposed approach experimentally on the Odroid-XU3 hardware platform with various mixes of applications from the Polybench benchmark suite. Additionally, a case-study is performed with a real-world application SLAMBench. Results show an average energy saving of 32% compared to existing approaches while still satisfying the performance requirements

    Mapping parallel programs to heterogeneous multi-core systems

    Get PDF
    Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile to high-performance computing. They promise to deliver increased performance at lower energy cost than purely homogeneous, CPU-based systems. In recent years GPU-based heterogeneous systems have become increasingly popular. They combine a programmable GPU with a multi-core CPU. GPUs have become flexible enough to not only handle graphics workloads but also various kinds of general-purpose algorithms. They are thus used as a coprocessor or accelerator alongside the CPU. Developing applications for GPU-based heterogeneous systems involves several challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus important to carefully map the tasks of an application to the most suitable processor in a system. Secondly, current frameworks for heterogeneous computing, such as OpenCL, are low-level, requiring a thorough understanding of the hardware by the programmer. This high barrier to entry could be lowered by automatically generating and tuning this code from a high-level and thus more user-friendly programming language. Both challenges are addressed in this thesis. For the task mapping problem a machine learning-based approach is presented in this thesis. It combines static features of the program code with runtime information on input sizes to predict the optimal mapping of OpenCL kernels. This approach is further extended to also take contention on the GPU into account. Both methods are able to outperform competing mapping approaches by a significant margin. Furthermore, this thesis develops a method for targeting GPU-based heterogeneous systems from OpenMP, a directive-based framework for parallel computing. OpenMP programs are translated to OpenCL and optimized for GPU performance. At runtime a predictive model decides whether to execute the original OpenMP code on the CPU or the generated OpenCL code on the GPU. This approach is shown to outperform both a competing approach as well as hand-tuned code

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    HeTM: Transactional Memory for Heterogeneous Systems

    Full text link
    Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT'19

    GPU 프로그램을위한 성능 모델링, 성능 튜닝 및 양자화

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·컴퓨터공학부, 2021.8. 이재진.GPUs have played an important role in solving many scientific problems that range across different domains. Writing GPU programs might be easy, but writing them efficiently is much more difficult. To achieve the best performance, it is necessary that the compiler and runtime have advanced techniques to compile and run the program efficiently. These techniques should be transparent to the programmers and help them avoid the burden of having to know many details of the underlying architecture. Among the most important aspects that help improve the performance of a GPU program, we focus on the problem of performance modeling, performance tuning and quantization. Performance modeling estimates the execution time of the program and can be useful in analyzing the program characteristics or partitioning the workload in a heterogenous system. Performance tuning finds the optimal solution from an optimization space in a reasonable time. Quantization reduces the precision needed to execute the program without losing significant output accuracy. The proposed techniques can be integrated into GPU compilers and runtimes to help them be more efficient.1 Introduction 1 1.1 Introduction 1 2 Performance Modeling 4 2.1 Introduction 4 2.2 Related Work8 2.3 Background 10 2.3.1 OpenCL Framework 10 2.3.2 GPU Architecture 11 2.3.3 Support Vector Regression.14 2.4 Prerequisites to efficient profiling: An insight to warp scheduling 16 2.5 Performance Estimation.23 2.5.1 Linear Model 24 2.5.2 Model based on Machine Learning 25 2.6 Evaluation 29 2.6.1 Evaluation Setup 29 2.6.2 Performance estimation results. 30 2.6.3 The ML-based model on different classes of kernels 37 2.6.4 The performance at different saturation points. 37 2.7 Conclusions 39 3 Performance Auto-tuning 41 3.1 Introduction 42 3.2 Related Work45 3.3 OpenCL and GPU Architectures 47 3.4 Effects of the Work-group Size 49 3.4.1 Occupancy50 3.4.2 Global Memory Coalescing 51 3.4.3 Cache Contention 56 3.4.4 Amount of Work.57 3.4.5 Work-group Scheduling and Barriers 58 3.4.6 Benchmark Applications 59 3.5 Auto-tuning Work-group Size.61 3.5.1 Workload Tuner.62 3.5.2 Non-coalescing Factor Tuner 64 3.5.3 Concurrency Tuner 66 3.5.4 Exhaustive-search Tuner 70 3.6 Evaluation 70 3.6.1 Overall Tuning Quality 70 3.6.2 Overall Tuning Cost 75 3.6.3 Effect of the Workload Tuner 76 3.6.4 Effect of the Non-coalescing Factor Tuner 77 3.6.5 Effect of the Concurrency Tuner 77 3.7 Conclusions 79 4 Quantization for Deep Learning Programs 80 4.1 Introduction 81 4.2 Related Work83 4.3 Background 85 4.3.1 Integer Quantization 85 4.3.2 Standard Techniques Used 87 4.4 Quantization Framework.88 4.4.1 Inference Phase 88 4.4.2 Training Phase 89 4.4.3 Adding Noise to the Scale 89 4.4.4 Adaptively Adjusting Precisions 93 4.4.5 Computation of Histogram.97 4.5 Experiments 97 4.5.1 Image Classification Tasks 100 4.5.2 Natural Language Processing 105 4.6 Conclusions 106 5 Conculsion 107 Acknowledgements 123박
    corecore