1,132 research outputs found

    CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

    Get PDF
    Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

    The future of computing beyond Moore's Law.

    Get PDF
    Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

    Simulating heterogeneous behaviours in complex systems on GPUs

    Get PDF
    Agent Based Modelling (ABM) is an approach for modelling dynamic systems and studying complex and emergent behaviour. ABMs have been widely applied in diverse disciplines including biology, economics, and social sciences. The scalability of ABM simulations is typically limited due to the computationally expensive nature of simulating a large number of individuals. As such, large scale ABM simulations are excellent candidates to apply parallel computing approaches such as Graphics Processing Units (GPUs). In this paper, we present an extension to the FLAME GPU 1 [1] framework which addresses the divergence problem, i.e. the challenge of executing the behaviour of non-homogeneous individuals on vectorised GPU processors. We do this by describing a modelling methodology which exposes inherent parallelism within the model which is exploited by novel additions to the software permitting higher levels of concurrent simulation execution. Moreover, we demonstrate how this extension can be applied to realistic cellular level tissue model by benchmarking the model to demonstrate a measured speedup of over 4x

    Simple out of order core for GPGPUs

    Get PDF
    GPU architectures have become popular for executing general-purpose programs which rely on having a large number of threads that run concurrently to hide the latency among dependent instructions. This approach has an important cost/overhead in terms of low data locality due to the increased pressure on the memory hierarchy of the many threads being run concurrently and the extra cost of storing and managing the on-chip state of those many threads. This paper presents SOCGPU (Simple Out-of-order Core for GPU), a simple out-of-order execution mechanism that does not require register renaming nor scoreboards. It uses a small Instruction Buffer and a tiny Dependence matrix to keep track of dependencies among instructions and avoid data hazards. Evaluations for an Nvidia Tesla V100-like GPU show that SOCGPU provides a speed-up of up to 2.3 in some machine learning programs and 1.38 on average for a variety of benchmarks, while it reduces energy consumption by 6.5%, with only 2.4% area overhead.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, and the ICREA Academia program.Peer ReviewedPostprint (author's final draft

    A study of recent contributions on simulation tools for Network-on-Chip (NoC)

    Get PDF
    The growth in the number of Intellectual Properties (IPs) or the number of cores on the same chip becomes a critical issue in System-on-Chip (SoC) due to the intra-communication problem between the chip elements. As a result, Network-on-Chip (NoC) has emerged as a new system architecture to overcome intra-communication issues. New approaches and methodologies have been developed by many researchers to improve NoC. Also, many NoC simulation tools have been proposed and adopted by both academia and industry. This paper presents a study of recent contributions on simulation tools for NoC. Furthermore, an overview of NoC is covered as well as a comparison between some NoC simulators to help facilitate research in on-chip communication
    • …
    corecore