Search CORE

15 research outputs found

Fast behavioural RTL simulation of 10B transistor SoC designs with Metro-Mpi

Author: Armejach Sanosa Adrià
Balkind Jonathan
Li Brian
López Paradís Guillem
Moreto Planas Miquel
Wallentowitz Stefan
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2023
Field of study

Chips with tens of billions of transistors have become today's norm. These designs are straining our electronic design automation tools throughout the design process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for the designer's benefit. However, tools largely remain restricted to a single machine and in the case of RTL simulation, we believe that this leaves much potential performance on the table. We introduce Metro-MPI to improve RTL simulation for modern 10 billion transistor-scale chips. Metro-MPI exploits the natural boundaries present in chip designs to partition RTL simulations and leverage High Performance Computing (HPC) techniques to extract parallelism. For chip designs that scale in size by exploiting latency-insensitive interfaces like networks-on-chip and AXI, Metro-MPI offers a new paradigm for RTL simulation scalability. Our implementation of Metro-MPI in Open-Piton+Ariane delivers 2.7 MIPS of RTL simulation throughput for the first time on a design with more than 10 billion transistors and 1,024 Linux-capable cores, opening new avenues for distributed RTL simulation of emerging system-on-chip designs. Compared to sequential and multithreaded RTL simulations of smaller designs, Metro-MPI achieves up to 135.98× and 9.29× speedups. Similarly, for a representative regression run, Metro-Mpireduces energy consumption by up to 2.53× and 2.91× .This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019-107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by the Arm-BSC Center of Excellence. G. Lopez-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994 and GSoC 2021, and M. Moreto by a Ramon y Cajal fellowship no. RYC-2016-21104. A. Armejach is a Serra Hunter Fellow.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

Synergistic Timing Speculation for Multi-Threaded Programs

Author: Yasin Atif
Publication venue: DigitalCommons@USU
Publication date: 01/05/2016
Field of study

Timing speculation is a promising approach to increase the processor performance and energy efficiency. Under timing speculation, an integrated circuit is allowed to operate at a speed faster than its slowest path|the critical path. It is based on the empirical observation, which is presented later in the thesis, that these critical path delays are rarely manifested during the program execution. Consequently, as long as the processor is equipped with an error detection and recovery mechanism, its performance can be increased and/or energy consumption reduced beyond that achievable by any other conventional operation. While many past works have dealt with timing speculation within a single core, in this work, a new direction is being uncovered | timing speculation for a multi-core processor executing a parallel, multi-threaded application. Through a rigorous cross-layered circuit architectural analysis, it is observed that during the execution of a multi-threaded program, there is a significant variation in circuit delay characteristics across different threads. Synergistic Timing Speculation (SynTS) is proposed to exploit this variation (heterogeneity) in path sensitization delays, to jointly optimize the energy and execution time of the many-core processor. In particular, SynTS uses a sampling based online error probability estimation technique, coupled with a polynomial time algorithm, to optimally determine the voltage, frequency and the amount of timing speculation for each thread. The experimental analysis is presented for three pipe stages, namely, Decode, SimpleALU and ComplexALU, with a reduction in Energy Delay Product by up to 26%, 25% and 7.5% respectively, compared to existing per-core timing speculation scheme. The analysis also embeds a case study for a General Purpose Graphics Processing Unit

DigitalCommons@USU

Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold GPGPU

Author: Shabanian Tahmoures
Publication venue: DigitalCommons@USU
Publication date: 01/08/2018
Field of study

Over the last decade, General Purpose Graphics Processing Units (GPGPUs) have garnered a substantial attention in the research community due to their extensive thread-level parallelism. GPGPUs provide a remarkable performance improvement over Central Processing Units (CPUs), for highly parallel applications. However, GPGPUs typically achieve this extensive thread-level parallelism at the cost of a large power consumption. Consequently, Near-Threshold Computing (NTC) provides a promising opportunity for designing energy-efficient GPGPUs (NTC-GPUs). However, NTC-GPUs suffer from a crucial Process Variation (PV)-inflicted performance bottleneck, which is called Choke Point. Choke Point is defined as one or small group of gates which is affected by PV. Choke Point is capable of varying the path-delay of circuit and causing different forms of timing violation. In this work, a cross-layer design technique is proposed to tackle the performance impediments caused by choke points in NTC-GPUs

DigitalCommons@USU

Simty: generalized SIMT execution on RISC-V

Author: Collange Caroline
Publication venue: HAL CCSD
Publication date: 14/10/2017
Field of study

International audienceWe present Simty, a massively multi-threaded RISC-V processor core that acts as a proof of concept for dynamic inter-thread vector-ization at the micro-architecture level. Simty runs groups of scalar threads executing SPMD code in lockstep, and assembles SIMD instructions dynamically across threads. Unlike existing SIMD or SIMT processors like GPUs or vector processors, Simty vector-izes scalar general-purpose binaries. It does not involve any instruction set extension or compiler change. Simty is described in synthesizable RTL. A FPGA prototype validates its scaling up to 2048 threads per core with 32-wide SIMD units. Simty provides an open platform for research on GPU micro-architecture, on hybrid CPU-GPU micro-architecture, or on heterogeneous platforms with throughput-optimized cores

INRIA a CCSD electronic archive server

An extended model to support detailed GPGPU reliability analysis

Author: Du B.
Reorda M. S.
RODRIGUEZ CONDIA JOSIE ESTEBAN
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

General Purpose Graphics Processing Units (GPGPUs) have been used in the last decades as accelerators in high demanding data processing applications, such as multimedia processing and high-performance computing. Nowadays, these devices are becoming popular even in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of transient faults, such as those produced by radiation effects. These effects can be represented in the system as Single Event Upsets (SEUs) and are able to generate intolerable application misbehaviors in safety critical environments. In this work, we extended the capabilities of an open-source VHDL GPGPU model (FlexGrip) in order to study and analyze in a much more detailed manner the effects of SEUs in some critical modules within a GPGPU. Simulation results showed that scheduler controller has different levels of SEU sensibility depending on the affected location. Moreover, a reduced number of execution units, in the GPGPU can decrease the system reliability

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

A Configurable Shared Scratchpad Memory for GPU-like Processors

Author: Cilardo A.
Donnarumma C.
Gagliardi M.
Publication venue
Publication date
Field of study

During the last years Field Programmable Gate Arrays and Graphics Processing Units have become increasingly important for high-performance computing. In particular, a number of industrial solutions and academic projects are proposing design frameworks based on FPGA-implemented GPU-like compute units. Existing GPU-like core projects provide limited hardware support for shared scratch-pad memory and particularly for the problem of bank conflicts, a major source of performance loss with many parallel kernels. In this paper, we present a configurable, GPU-like oriented scratchpad memory with built-in support for bank remapping. The core is fully synthetizable on FPGA with a contained hardware cost. We also validated the presented architecture with a cycle-accurate event-driven emulator written in C++ as well as an RTL simulator tool. Last, we demonstrated the impact of bank remapping and other parameters available with the proposed configurable shared scratchpad memory by evaluating the performance of two real-world parallelized kernels

Università degli Studi di Napoli Federico Il Open Archive

HeteroCore GPU to exploit TLP-resource diversity

Author: Eeckhout Lieven
Wang Zhiying
Zhao Xia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Three-Dimensional Processing-In-Memory-Architectures: A Holistic Tool For Modeling And Simulation

Author: Siegl Patrick Daniel Marcus
Publication venue
Publication date: 01/01/2018
Field of study

Die gemeinhin als Memory Wall bekannte, sich stetig weitende Leistungslücke zwischen Prozessor- und Speicherarchitekturen erfordert neue Konzepte, um weiterhin eine Skalierung der Rechenleistung zu ermöglichen. Da Speicher als die Beschränkung innerhalb einer Von-Neumann-Architektur identifiziert wurden, widmet sich die Arbeit dieser Problemstellung. Obgleich dreidimensionale Speicher zu einer Linderung der Memory Wall beitragen können, sind diese alleinig für die zukünftige Skalierung ungenügend. Aufgrund höherer Effizienzen stellt die Integration von Rechenkapazität in den Speicher (Processing-In-Memory, PIM) ein vielversprechender Ausweg dar, jedoch existiert ein Mangel an PIM-Simulationsmodellen. Daher wurde ein flexibles Simulationswerkzeug für dreidimensionale Speicherstapel geschaffen, welches zur Modellierung von dreidimensionalen PIM erweitert wurde. Dieses kann Speicherstapel wie etwa Hybrid Memory Cube standardkonform simulieren und bietet zugleich eine hohe Genauigkeit indem auf elementaren Datenpaketen in Kombination mit dem Hardware validierten Simulator BOBSim modelliert wird. Ein eigens entworfener Simulationstaktbaum ermöglicht zugleich eine schnelle Ausführung. Messungen weisen im funktionalen Modus eine 100-fache Beschleunigung auf, wohingegen eine Verdoppelung der Ausführungsgeschwindigkeit mit Taktgenauigkeit erzielt wird. Anhand eines eigens implementierten, binärkompatiblen GPU-Beschleunigers wird die Modellierung einer vollständig dreidimensionalen PIM-Architektur demonstriert. Dabei orientieren sich die maximalen Hardwareressourcen an einem PIM-Beschleuniger aus der Literatur. Evaluiert wird einerseits das GPU-Simulationsmodell eigenständig, andererseits als PIM-Verbund jeweils mit Hilfe einer repräsentativ gewählten, speicherbeschränkten geophysikalischen Bildverarbeitung. Bei alleiniger Betrachtung des GPU-Simulationsmodells weist dieses eine signifikant gesteigerte Simulationsgeschwindigkeit auf, bei gleichzeitiger Abweichung von 6% gegenüber dem Verilator-Modell. Nachfolgend werden innerhalb dieser Arbeit unterschiedliche Konfigurationen des integrierten PIM-Beschleunigers evaluiert. Je nach gewählter Konfiguration kann der genutzte Algorithmus entweder bis zu 140GFLOPS an tatsächlicher Rechenleistung abrufen oder eine maximale Recheneffizienz von synthetisch 30% bzw. real 24,5% erzielen. Letzteres stellt eine Verdopplung des Stands der Technik dar. Eine anknüpfende Diskussion erläutert eingehend die Resultate.The steadily widening performance gap between processor- and memory-architectures - commonly known as the Memory Wall - requires novel concepts to achieve further scaling in processing performance. As memories were identified as the limitation within a Von-Neumann-architecture, this work addresses this constraining issue. Although three-dimensional memories alleviate the effects of the Memory Wall, the sole utilization of such memories would be insufficient. Due to higher efficiencies, the integration of processing capacity into memories (so-called Processing-In-Memory, PIM) depicts a promising alternative. However, a lack of PIM simulation models still remains. As a consequence, a flexible simulation tool for three-dimensional stacked memories was established, which was extended for modeling three-dimensional PIM architectures. This tool can simulate stacked memories such as Hybrid Memory Cube standard-compliant and simultaneously offers high accuracy by modeling on elementary data packets (FLIT) in combination with the hardware validated BOBSim simulator. To this, a specifically designed simulation clock tree enables an rapid simulation execution. A 100x speed up in simulation execution can be measured while utilizing the functional mode, whereas a 2x speed up is achieved during clock-cycle accuracy mode. With the aid of a specifically implemented, binary compatible GPU accelerator and the established tool, the modeling of a holistic three-dimensional PIM architecture is demonstrated within this work. Hardware resources used were constrained by a PIM architecture from literature. A representative, memory-bound, geophysical imaging algorithm was leveraged to evaluate the GPU model as well as the compound PIM simulation model. The sole GPU simulation model depicts a significantly improved simulation performance with a deviation of 6% compared to a Verilator model. Subsequently, various PIM accelerator configurations with the integrated GPU model were evaluated. Depending on the chosen PIM configuration, the utilized algorithm achieves 140GFLOPS of processing performance or a maximum computing efficiency of synthetically 30% or realistically 24.5%. The latter depicts a 2x improvement compared to state-of-the-art. A following discussion showcases the results in depth

Digitale Bibliothek Braunschweig