24,330 research outputs found

    Performance and power comparisons between Fermi and Cypress GPUs

    Get PDF
    In recent years, modern graphics processing units have been widely adopted in high performance computing areas to solve large scale computation problems. The leading GPU manufacturers Nvidia and AMD have introduced series of products to the market. While sharing many similar design concepts, GPUs from these two manufacturers differ in several aspects on processor cores and the memory subsystem. In this work, we conduct a comprehensive study to characterize and compare the architectural features of Nvidia’s Fermi and AMD’s Cypress GPUs. We first investigate the performance and power consumptions of an AMD Cypress GPU. By employing a rigorous statistical model to analyze the execution behaviors of representative general-purpose GPU (GPGPU) applications, we conduct insightful investigations on the target GPU architecture. Our results demonstrate that the GPU execution throughput and the power dissipation are dependent on different architectural variables. Furthermore, we design a set of micro-benchmarks to study the power consumption features of different function units on the GPU. Based on those results, we derive instructive principles that can guide the design of power-efficient high performance computing systems. We then make the concentration shift to the Nvidia Fermi GPU and compare it with the product from AMD. Our results indicate that these two products have diverse advantages that are reflected in their performance for different sets of applications. In addition, we also compare the energy efficiencies of these two platforms since power/energy consumption is a major concern in the high performance computing system

    Intelligent scheduling for simultaneous CPU-GPU applications

    Get PDF
    Heterogeneous computing systems with both general purpose multicore central processing units (CPU) and specialized accelerators has emerged recently. Graphics processing unit (GPU) is the most widely used accelerator. To fully utilize such a heterogeneous system’s full computing power, coordination between the two distinct devices, CPU and GPU, is necessary. Previous research has addressed this issue of partitioning the workloads between CPU and GPU from various aspects for regular applications which have high parallelism and little data dependent control flows. However, it is still not clear how irregular applications, which behave differently on different inputs, could be efficiently scheduled on such heterogeneous computing systems. Since CPUs and GPUs have different characteristics, task chunks of these irregular applications show preference, or affinity, to a particular device in heterogeneous computing systems. In this work, we show that by using the method of allocating workloads at task chunk granularity based on each chunk’s device affinity, accompanied with work-stealing as the load balancing mechanism, we can achieve a performance improvement of as much as 1.5x over traditional ratio-based allocation, and up to 5x over naive GPU-only allocation on three irregular graph analytics applications

    A Review on Software Architectures for Heterogeneous Platforms

    Full text link
    The increasing demands for computing performance have been a reality regardless of the requirements for smaller and more energy efficient devices. Throughout the years, the strategy adopted by industry was to increase the robustness of a single processor by increasing its clock frequency and mounting more transistors so more calculations could be executed. However, it is known that the physical limits of such processors are being reached, and one way to fulfill such increasing computing demands has been to adopt a strategy based on heterogeneous computing, i.e., using a heterogeneous platform containing more than one type of processor. This way, different types of tasks can be executed by processors that are specialized in them. Heterogeneous computing, however, poses a number of challenges to software engineering, especially in the architecture and deployment phases. In this paper, we conduct an empirical study that aims at discovering the state-of-the-art in software architecture for heterogeneous computing, with focus on deployment. We conduct a systematic mapping study that retrieved 28 studies, which were critically assessed to obtain an overview of the research field. We identified gaps and trends that can be used by both researchers and practitioners as guides to further investigate the topic

    A Modeling Approach based on UML/MARTE for GPU Architecture

    Get PDF
    Nowadays, the High Performance Computing is part of the context of embedded systems. Graphics Processing Units (GPUs) are more and more used in acceleration of the most part of algorithms and applications. Over the past years, not many efforts have been done to describe abstractions of applications in relation to their target architectures. Thus, when developers need to associate applications and GPUs, for example, they find difficulty and prefer using API for these architectures. This paper presents a metamodel extension for MARTE profile and a model for GPU architectures. The main goal is to specify the task and data allocation in the memory hierarchy of these architectures. The results show that this approach will help to generate code for GPUs based on model transformations using Model Driven Engineering (MDE).Comment: Symposium en Architectures nouvelles de machines (SympA'14) (2011

    Architecture-Aware Optimization on a 1600-core Graphics Processor

    Get PDF
    The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated compute node requires deep technical knowledge of the underlying architecture. Although significant literature exists on how to optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on the AMD GPU. Consequently, we present and evaluate architecture-aware optimizations for the AMD GPU. The most prominent optimizations include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU optimizations by applying each optimization in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, the AMD Radeon HD 5870 GPU delivers 65% better performance than with the wellknown NVIDIA-specific optimizations
    • …
    corecore