2,173 research outputs found
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems
Recent technological advances have greatly improved the performance and
features of embedded systems. With the number of just mobile devices now
reaching nearly equal to the population of earth, embedded systems have truly
become ubiquitous. These trends, however, have also made the task of managing
their power consumption extremely challenging. In recent years, several
techniques have been proposed to address this issue. In this paper, we survey
the techniques for managing power consumption of embedded systems. We discuss
the need of power management and provide a classification of the techniques on
several important parameters to highlight their similarities and differences.
This paper is intended to help the researchers and application-developers in
gaining insights into the working of power management techniques and designing
even more efficient high-performance embedded systems of tomorrow
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
Towards a Software Transactional Memory for heterogeneous CPU-GPU processors
The heterogeneous Accelerated Processing Units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs provide the programmer with platform atomics, used to communicate the CPU cores with the GPU using simple atomic datatypes. However, ensuring consistency for complex data types is a task delegated to programmers, who have to implement a mutual exclusion mechanism. Transactional Memory (TM) is an optimistic approach to implement mutual exclusion. With TM, shared data can be accessed by multiple computing threads speculatively, but changes are only visible if a transaction ends with no conflict with others in its memory accesses. TM has been studied and implemented in software and hardware for both CPU and GPU platforms, but an integrated solution has not been provided for APU processors.
In this paper we present APUTM, a software TM designed to work on heterogeneous APU processors. The design of APUTM focuses on minimizing the access to shared metadata in order to reduce the communication overhead via expensive platform atomics. The main objective of APUTM is to help us understand the tradeoffs of implementing a sofware TM on an heterogeneous CPU-GPU platform and to identify the key aspects to be considered in each device. In our experiments, we compare the adaptability of APUTM to execute in one of the devices (CPU or GPU) or in both of them simultaneously. These experiments show that APUTM is able to outperform sequential execution of the applications.This work has been supported by projects TIN2013-42253-P and TIN2016-80920-R, from the Spanish Government, P11-TIC8144 and P12- TIC1470, from Junta de Andalucía, and Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming next-generation accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
- …