Search CORE

5 research outputs found

Recommended from our members

Design and implementation of configuration modules in a programmable hardware-assisted cache emulator (PHA$E)

Author: Chalainanont Nirut
Publication venue: 'Oregon State University'
Publication date
Field of study

Memory hierarchy design is becoming more important as the speed gap be- tween processor and memory continues to grow. Investigations of memory perfor- mance have typically been conducted using trace-driven emulation, which could take tremendous resources (e.g. long emulation time, large storage requirements for traces, and high overall cost). Recent research has proposed the use of hard- ware for performing cache emulations. Such an approach is advantageous as it can be done in real-time, which eliminates the need for large storage for traces, reduces the emulation time, and improves the accuracy of the results. This thesis discusses the preliminary work with the Programmable Hardware Assisted Cache Emulator (PHA

E), a system for emulating cache in real-time. PHA

E is implemented using Field Programmable Gate Array (FPGA) chips, so it is flexible and configurable. Once configured, PHA

E can emulate various sizes of cache, different cache orga- nizations, and many replacement algorithms. This thesis describes the design and implementation of Very High Speed Integrated Circuit Hardware Description Lan- guage (VHDL) modules that were programmed into PAH

E to make it capable of emulating off-chip shared level 3 caches with varying sizes and set-associativities. Furthermore, the emulation results from SPEC benchmarks (SPECcpu2000 and SPECjAppServer2002 [13]), and a large vocabulary continuous speech recognition (LVCSR) system [24] are presented to verify the functionality of PHA$E. Lastly, future research directions are identified

ScholarsArchive@OSU

Makinote: An FPGA-Based HW/SW Platform for Pre-Silicon Emulation of RISC-V Designs

Author: Cano Francelly
Cervero Teresa
Kropotov Alexander
Martorell Xavier
Perdomo Elias
Salami Behzad
Zafar Syed
Publication venue
Publication date: 05/02/2024
Field of study

Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large number of FPGAs (in total 96 AMD/Xilinx Alveo U55c) to emulate massive size RTL designs (up to 750M ASIC cells). In addition, we introduce our FPGA shell as a powerful tool to facilitate the utilization of such a large FPGA cluster with minimal effort needed by the designers. The proposed FPGA shell provides an easy-to-use interface for the RTL developers to rapidly port such design into several FPGAs by automatically connecting to the necessary ports, e.g., PCIe Gen4, DRAM (DDR4 and HBM), ETH10g/100g. Moreover, specific drivers for exploiting RISC-V based architectures are provided within the set of tools associated with the FPGA shell. We release the tool online for further extensions. We validate the efficiency of our hardware platform (i.e., FPGA cluster) and the software tool (i.e., FPGA Shell) by emulating a RISC-V processor and experimenting HPC Challenge application running on 32 FPGAs. Our results demonstrate that the performance improves by 8 times over the single-FPGA case.Comment: 7 pages, 5 figures, presented in Rapid Simulation and Performance Evaluation for Design 2024 (RAPIDO24) and published in ACM Proceedings of Rapid Simulation and Performance Evaluation for Desig

arXiv.org e-Print Archive

HyperFPGA: SoC-FPGA Cluster Architecture for Supercomputing and Scientific applications

Author: FLORIAN SAMAYOA WERNER OSWALDO
Publication venue: Università degli Studi di Trieste
Publication date: 02/02/2024
Field of study

Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments.Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments

Archivio istituzionale della ricerca - Università di Trieste

Recommended from our members

Performance Debugging Frameworks for FPGA High-Level Synthesis

Author: Choi Young-kyu
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Using high-level synthesis (HLS) tools for field-programmable gate array (FPGA) design is becoming an increasingly popular choice because HLS tools can generate a high-quality design in a short development time. However, current HLS tools still cannot adequately support users in understanding and fixing the performance issues of the current design. That is, current HLS tools lack in performance debugging capability. Previous work on performance debugging automates the process of inserting hardware monitors in low-level register-transfer level (RTL) languages which limits the comprehensibility of the obtained result. Instead, our HLS-based flows offer analysis on a function or loop level and provide more intuitive feedback that can be used to pinpoint the performance bottleneck of a design. In this dissertation, we present a collection of HLS-based debugging frameworks for various purposes and characteristics of the design. First, we address the problem in the HLS synthesis step, where an inaccurate cycle estimation is provided if the program has input-dependent behavior. We propose a new performance estimator that automatically instruments code that models the hardware execution behavior and interprets the information from the HLS software simulation. However, the performance estimation result of this flow may not be accurate for a type of designs that cannot be simulated correctly by existing HLS software simulators. To handle such cases, we propose a new software simulator that provides cycle-accurate result based on the HLS scheduling information. If the input dataset is not available for software simulation or high-level models do not exist for all components of the FPGA design, we also present an on-board monitoring flow for automated cycle extraction and stall analysis. Finally, we address the needs of HLS programmers to automatically find the best set of directives for FPGA designs. We propose a design space exploration (DSE) framework to optimize applications with variable loop bounds in Polybench benchmark. A quantitative comparison among the proposed frameworks is shown using the sparse matrix-vector multiplication benchmark

eScholarship - University of California

Active Buffer Development in CBM Experiment

Author: Gao Wenxue
Publication venue
Publication date: 01/01/2012
Field of study

Die Anforderungen an das Datenerfassungssystem (DAQ) des CBM Experiments an der GSI sind mit einer Datenrate von 1TB/s und einer Ereignisrate von 100 kHz sehr hoch und stellen auch im Vergleich zu anderen Experimenten in der Hochenergiephysik eine Herausforderung dar. Bei der Datennahme wird daher ein aktiver Zwischenspeicher („active buffer“) eingesetzt, der durch eine Vorsortierung der Datenfragmente und eine intelligente Übertragung in den Hostrechner den Aufbau der Datenstrukturen zur Ereignisverarbeitung unterstützt. Das Projekt erfordert ein modulares Framework und die Arbeit umfasst die Entwicklung, Verifikation und Test von FPGA Modulen zum effizienten Datentransfer, zur Zwischenspeicherung und zur Rekonfiguration, sowie von Software zur automatischen Transformation von HDL Beschreibungen. Die zentralen Bauteile dieses Zwischenspeichers sind ein leistungsfähiges FPGA zur Datenflusssteuerung und ein DDR2 SDRAM Modul mit einer Kapazität von 512MB. Durch eine spezielle Ansteuerungsmethode kann das Speichermodul zusammen mit den FPGA-internen Speicherelementen als leistungsfähiges, großes FIFO betrieben werden. Den Datantransfer vom Zwischenspeicher zum PC übernimmt eine spezielle DMA Einheit, die an den PCIe-Kern im FPGA angeschlossen ist. Die zwei DMA Kanäle arbeiten mit Scatter-Gather Unterstützung und erreichen beim Transfer zum PC 543 MB/s und in der Gegenrichtung 790MB/s. Die für die Vorsortierung wichtige Übertragung der Zeitstempel („epoch marker“) erfolgt ebenfalls mit einem DMA Kanal. Die Verifikation ist eine wichtige Stufe bei der Entwicklung einer umfangreichen FPGA Anwendungen wie des aktiven Zwischenspeichers. Daher wurden die HDL Module der Funktionen für das PCI Express „transaction layer“ mit einer Reihe unterschiedlicher Simulationsumgebungen verifiziert. Auf dieser Grundlage können Verbesserungen an der Funktionalität schnell und zuverlässig umgesetzt werden, womit eine konsistente Weiterentwicklung gewährleistet ist. Aufgrund der typischen PC-Architektur muss die PCIe-Einheit im FPGA bereits während des Startvorgangs funktionsfähig sein, wohingegen die eigentliche aktive Zwischenspeicherfunktion erst zusammen mit der entsprechenden Anwendungssoftware verfügbar sein muss. Strikte Modularisierung zusammen mit dynamischer, partieller Rekonfigurierung („DPR“) ermöglichen Veränderungen in der Zwischenspeicherfunktion zur Laufzeit. Ein weiter Grund für die Nutzung der DPR sind die Lizenzbedingungen der PCIe-Core-Implementierung mit Virtex4-FPGAs. DPR kann bei den FPGA Familien Virtex-4, -5 und -6 im Rahmen der „PlanAhead“ Software von Xilinx benutzt werden. DPR wird im Projekt im Sinne eines allgemeinen Coprozessors eingesetzt, indem die FPGA Konfiguration über die PCIe und die interne Konfigurationsschnittstelle („ICAP“) im FPGA nachgeladen wird. Um DPR bei hohen Taktgeschwindigkeiten einsetzen zu können, muss die Verbindungslogik zwischen den statischen und dynamischen Modulen speziellen Anforderungen genügen. Da die manuelle Anpassung existierenden Module an diese Anforderungen aufwändig und fehleranfällig ist, wurde das Programm „Logro“ entwickelt, das HDL Beschreibungen mittels einer speziellen Pipeline- Neustrukturierung automatisch so transformiert, dass die DPR Anforderungen erfüllt werden. Mit Logro V1.0 wurden dabei gute Ergebnisse erzielt, die hier vorgestellt werden

Heidelberger Dokumentenserver

GSI Repository