Search CORE

360 research outputs found

Exploration Into The Performance Of Asymmetric D-Ary Heap-Based Algorithms For The Hsa Architecture

Author: Adams Stephen
Publication venue: eGrove
Publication date: 01/01/2014
Field of study

State of the art baseband DSP platforms for Software Defined Radio: A survey

Author: Claudio Brunelli
Fabio Garzia
Heikki Berg
Jari Nurmi
Omer Anjum
Tapani Ahonen
Publication venue: Springer Nature
Publication date: 01/01/2011
Field of study

Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.Peer reviewe

Springer - Publisher Connector

HiHGNN: Accelerating HGNNs through Parallelism and Data Reusability Exploitation

Author: Fan Dongrui
Han Dengke
Kim John
Li Wenming
Tang Zhimin
Wang Duo
Xue Runzhen
Yan Mingyu
Yang Xiaocheng
Ye Xiaochun
Zou Mo
Publication venue
Publication date: 24/07/2023
Field of study

Heterogeneous graph neural networks (HGNNs) have emerged as powerful algorithms for processing heterogeneous graphs (HetGs), widely used in many critical fields. To capture both structural and semantic information in HetGs, HGNNs first aggregate the neighboring feature vectors for each vertex in each semantic graph and then fuse the aggregated results across all semantic graphs for each vertex. Unfortunately, existing graph neural network accelerators are ill-suited to accelerate HGNNs. This is because they fail to efficiently tackle the specific execution patterns and exploit the high-degree parallelism as well as data reusability inside and across the processing of semantic graphs in HGNNs. In this work, we first quantitatively characterize a set of representative HGNN models on GPU to disclose the execution bound of each stage, inter-semantic-graph parallelism, and inter-semantic-graph data reusability in HGNNs. Guided by our findings, we propose a high-performance HGNN accelerator, HiHGNN, to alleviate the execution bound and exploit the newfound parallelism and data reusability in HGNNs. Specifically, we first propose a bound-aware stage-fusion methodology that tailors to HGNN acceleration, to fuse and pipeline the execution stages being aware of their execution bounds. Second, we design an independency-aware parallel execution design to exploit the inter-semantic-graph parallelism. Finally, we present a similarity-aware execution scheduling to exploit the inter-semantic-graph data reusability. Compared to the state-of-the-art software framework running on NVIDIA GPU T4 and GPU A100, HiHGNN respectively achieves an average 41.5

\times

and 8.6

\times

speedup as well as 106

\times

and 73

\times

energy efficiency with quarter the memory bandwidth of GPU A100

arXiv.org e-Print Archive

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Author: Brunie Nicolas
Collange Caroline
Diamos Gregory
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/11/2011
Field of study

International audienceSingle-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot