Search CORE

1,132 research outputs found

CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

Author: Eeckhout Lieven
Jerger Natalie Enright
Ma Sheng
Wang Zhiying
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

Ghent University Academic Bibliography

The future of computing beyond Moore's Law.

Author: Shalf John
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

Ezid

eScholarship - University of California

Simulating heterogeneous behaviours in complex systems on GPUs

Author: Cai
Chakroun
Chen
Chen
Collins
Di Salvo
Diamos
Fung
Gebhart
Han
Holcombe
Jog
Kayiran
Khorasani
Kirk
Laville
Laville
Lee
Lee
Liang
Luke
Lysenko
Meng
Mozhgan K. Chimeh
Narasiman
Nickolls
Paul Richmond
Perumalla
Reynolds
Rhu
Richmond
Richmond
Richmond
Rogers
Rogers
Shen
Siebers
Sun
Tisue
Walker
Xiang
Publication venue: 'Elsevier BV'
Publication date: 01/04/2018
Field of study

Agent Based Modelling (ABM) is an approach for modelling dynamic systems and studying complex and emergent behaviour. ABMs have been widely applied in diverse disciplines including biology, economics, and social sciences. The scalability of ABM simulations is typically limited due to the computationally expensive nature of simulating a large number of individuals. As such, large scale ABM simulations are excellent candidates to apply parallel computing approaches such as Graphics Processing Units (GPUs). In this paper, we present an extension to the FLAME GPU 1 [1] framework which addresses the divergence problem, i.e. the challenge of executing the behaviour of non-homogeneous individuals on vectorised GPU processors. We do this by describing a modelling methodology which exposes inherent parallelism within the model which is exploited by novel additions to the software permitting higher levels of concurrent simulation execution. Moreover, we demonstrate how this extension can be applied to realistic cellular level tissue model by benchmarking the model to demonstrate a measured speedup of over 4x

Crossref

White Rose Research Online

Simple out of order core for GPGPUs

Author: Arnau Montañés José María
González Colás Antonio María
Huerta Gañán Rodrigo
Publication venue: Association for Computing Machinery (ACM)
Publication date: 01/01/2023
Field of study

GPU architectures have become popular for executing general-purpose programs which rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This approach has an important cost/overhead in terms of low data locality due to the increased pressure on the memory hierarchy of the many threads being run concurrently and the extra cost of storing and managing the on-chip state of those many threads. This paper presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia Tesla V100-like GPU show that SOCGPU provides a speed-up of up to 2.3 in some machine learning programs and 1.38 on average for a variety of benchmarks, while it reduces energy consumption by 6.5%, with only 2.4% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

A study of recent contributions on simulation tools for Network-on-Chip (NoC)

Author: Opoku Agyeman Michael
Saleh Alalaki Muthana
Publication venue
Publication date
Field of study

The growth in the number of Intellectual Properties (IPs) or the number of cores on the same chip becomes a critical issue in System-on-Chip (SoC) due to the intra-communication problem between the chip elements. As a result, Network-on-Chip (NoC) has emerged as a new system architecture to overcome intra-communication issues. New approaches and methodologies have been developed by many researchers to improve NoC. Also, many NoC simulation tools have been proposed and adopted by both academia and industry. This paper presents a study of recent contributions on simulation tools for NoC. Furthermore, an overview of NoC is covered as well as a comparison between some NoC simulators to help facilitate research in on-chip communication

NECTAR