5 research outputs found
Recommended from our members
Design and implementation of configuration modules in a programmable hardware-assisted cache emulator (PHA$E)
Memory hierarchy design is becoming more important as the speed gap be- tween processor and memory continues to grow. Investigations of memory perfor- mance have typically been conducted using trace-driven emulation, which could take tremendous resources (e.g. long emulation time, large storage requirements for traces, and high overall cost). Recent research has proposed the use of hard- ware for performing cache emulations. Such an approach is advantageous as it can be done in real-time, which eliminates the need for large storage for traces, reduces the emulation time, and improves the accuracy of the results. This thesis discusses the preliminary work with the Programmable Hardware Assisted Cache Emulator (PHAE is implemented using Field Programmable Gate Array (FPGA) chips, so it is flexible and configurable. Once configured, PHAE to make it capable of emulating off-chip shared level 3 caches with varying sizes and set-associativities. Furthermore, the emulation results from SPEC benchmarks (SPECcpu2000 and SPECjAppServer2002 [13]), and a large vocabulary continuous speech recognition (LVCSR) system [24] are presented to verify the functionality of PHA$E. Lastly, future research directions are identified
Makinote: An FPGA-Based HW/SW Platform for Pre-Silicon Emulation of RISC-V Designs
Emulating chip functionality before silicon production is crucial, especially
with the increasing prevalence of RISC-V-based designs. FPGAs are promising
candidates for such purposes due to their high-speed and reconfigurable
architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster
platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is
composed of a large number of FPGAs (in total 96 AMD/Xilinx Alveo U55c) to
emulate massive size RTL designs (up to 750M ASIC cells). In addition, we
introduce our FPGA shell as a powerful tool to facilitate the utilization of
such a large FPGA cluster with minimal effort needed by the designers. The
proposed FPGA shell provides an easy-to-use interface for the RTL developers to
rapidly port such design into several FPGAs by automatically connecting to the
necessary ports, e.g., PCIe Gen4, DRAM (DDR4 and HBM), ETH10g/100g. Moreover,
specific drivers for exploiting RISC-V based architectures are provided within
the set of tools associated with the FPGA shell. We release the tool online for
further extensions.
We validate the efficiency of our hardware platform (i.e., FPGA cluster) and
the software tool (i.e., FPGA Shell) by emulating a RISC-V processor and
experimenting HPC Challenge application running on 32 FPGAs. Our results
demonstrate that the performance improves by 8 times over the single-FPGA case.Comment: 7 pages, 5 figures, presented in Rapid Simulation and Performance
Evaluation for Design 2024 (RAPIDO24) and published in ACM Proceedings of
Rapid Simulation and Performance Evaluation for Desig
HyperFPGA: SoC-FPGA Cluster Architecture for Supercomputing and Scientific applications
Since their inception, supercomputers have addressed problems that far exceed those of a single computing device.
Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks.
These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems.
In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend.
With more pressure on energy efficiency, an alternative to traditional architectures is needed.
Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption.
In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}.
Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account.
The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA.
In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness.
The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection.
By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented.
A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools.
The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems.
Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments.Since their inception, supercomputers have addressed problems that far exceed those of a single computing device.
Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks.
These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems.
In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend.
With more pressure on energy efficiency, an alternative to traditional architectures is needed.
Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption.
In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}.
Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account.
The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA.
In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness.
The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection.
By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented.
A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools.
The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems.
Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments
Recommended from our members
Performance Debugging Frameworks for FPGA High-Level Synthesis
Using high-level synthesis (HLS) tools for field-programmable gate array (FPGA) design is becoming an increasingly popular choice because HLS tools can generate a high-quality design in a short development time. However, current HLS tools still cannot adequately support users in understanding and fixing the performance issues of the current design. That is, current HLS tools lack in performance debugging capability. Previous work on performance debugging automates the process of inserting hardware monitors in low-level register-transfer level (RTL) languages which limits the comprehensibility of the obtained result. Instead, our HLS-based flows offer analysis on a function or loop level and provide more intuitive feedback that can be used to pinpoint the performance bottleneck of a design. In this dissertation, we present a collection of HLS-based debugging frameworks for various purposes and characteristics of the design. First, we address the problem in the HLS synthesis step, where an inaccurate cycle estimation is provided if the program has input-dependent behavior. We propose a new performance estimator that automatically instruments code that models the hardware execution behavior and interprets the information from the HLS software simulation. However, the performance estimation result of this flow may not be accurate for a type of designs that cannot be simulated correctly by existing HLS software simulators. To handle such cases, we propose a new software simulator that provides cycle-accurate result based on the HLS scheduling information. If the input dataset is not available for software simulation or high-level models do not exist for all components of the FPGA design, we also present an on-board monitoring flow for automated cycle extraction and stall analysis. Finally, we address the needs of HLS programmers to automatically find the best set of directives for FPGA designs. We propose a design space exploration (DSE) framework to optimize applications with variable loop bounds in Polybench benchmark. A quantitative comparison among the proposed frameworks is shown using the sparse matrix-vector multiplication benchmark
Active Buffer Development in CBM Experiment
Die Anforderungen an das Datenerfassungssystem (DAQ) des CBM Experiments an der GSI sind mit einer Datenrate von 1TB/s und einer Ereignisrate von 100 kHz sehr hoch und stellen auch im Vergleich zu anderen Experimenten in der Hochenergiephysik eine Herausforderung dar. Bei der Datennahme wird daher ein aktiver Zwischenspeicher (âactive bufferâ) eingesetzt, der durch eine Vorsortierung der Datenfragmente und eine intelligente Ăbertragung in den Hostrechner den Aufbau der Datenstrukturen zur Ereignisverarbeitung unterstĂŒtzt. Das Projekt erfordert ein modulares Framework und die Arbeit umfasst die Entwicklung, Verifikation und Test von FPGA Modulen zum effizienten Datentransfer, zur Zwischenspeicherung und zur Rekonfiguration, sowie von Software zur automatischen Transformation von HDL Beschreibungen. Die zentralen Bauteile dieses Zwischenspeichers sind ein leistungsfĂ€higes FPGA zur Datenflusssteuerung und ein DDR2 SDRAM Modul mit einer KapazitĂ€t von 512MB. Durch eine spezielle Ansteuerungsmethode kann das Speichermodul zusammen mit den FPGA-internen Speicherelementen als leistungsfĂ€higes, groĂes FIFO betrieben werden. Den Datantransfer vom Zwischenspeicher zum PC ĂŒbernimmt eine spezielle DMA Einheit, die an den PCIe-Kern im FPGA angeschlossen ist. Die zwei DMA KanĂ€le arbeiten mit Scatter-Gather UnterstĂŒtzung und erreichen beim Transfer zum PC 543 MB/s und in der Gegenrichtung 790MB/s. Die fĂŒr die Vorsortierung wichtige Ăbertragung der Zeitstempel (âepoch markerâ) erfolgt ebenfalls mit einem DMA Kanal. Die Verifikation ist eine wichtige Stufe bei der Entwicklung einer umfangreichen FPGA Anwendungen wie des aktiven Zwischenspeichers. Daher wurden die HDL Module der Funktionen fĂŒr das PCI Express âtransaction layerâ mit einer Reihe unterschiedlicher Simulationsumgebungen verifiziert. Auf dieser Grundlage können Verbesserungen an der FunktionalitĂ€t schnell und zuverlĂ€ssig umgesetzt werden, womit eine konsistente Weiterentwicklung gewĂ€hrleistet ist. Aufgrund der typischen PC-Architektur muss die PCIe-Einheit im FPGA bereits wĂ€hrend des Startvorgangs funktionsfĂ€hig sein, wohingegen die eigentliche aktive Zwischenspeicherfunktion erst zusammen mit der entsprechenden Anwendungssoftware verfĂŒgbar sein muss. Strikte Modularisierung zusammen mit dynamischer, partieller Rekonfigurierung (âDPRâ) ermöglichen VerĂ€nderungen in der Zwischenspeicherfunktion zur Laufzeit. Ein weiter Grund fĂŒr die Nutzung der DPR sind die Lizenzbedingungen der PCIe-Core-Implementierung mit Virtex4-FPGAs. DPR kann bei den FPGA Familien Virtex-4, -5 und -6 im Rahmen der âPlanAheadâ Software von Xilinx benutzt werden. DPR wird im Projekt im Sinne eines allgemeinen Coprozessors eingesetzt, indem die FPGA Konfiguration ĂŒber die PCIe und die interne Konfigurationsschnittstelle (âICAPâ) im FPGA nachgeladen wird. Um DPR bei hohen Taktgeschwindigkeiten einsetzen zu können, muss die Verbindungslogik zwischen den statischen und dynamischen Modulen speziellen Anforderungen genĂŒgen. Da die manuelle Anpassung existierenden Module an diese Anforderungen aufwĂ€ndig und fehleranfĂ€llig ist, wurde das Programm âLogroâ entwickelt, das HDL Beschreibungen mittels einer speziellen Pipeline- Neustrukturierung automatisch so transformiert, dass die DPR Anforderungen erfĂŒllt werden. Mit Logro V1.0 wurden dabei gute Ergebnisse erzielt, die hier vorgestellt werden