274 research outputs found

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Automated Hardware Prototyping for 3D Network on Chips

    Get PDF
    Vor mehr als 50 Jahren stellte IntelÂź MitbegrĂŒnder Gordon Moore eine Prognose zum Entwicklungsprozess der Transistortechnologie auf. Er prognostizierte, dass sich die Zahl der Transistoren in integrierten Schaltungen alle zwei Jahre verdoppeln wird. Seine Aussage ist immer noch gĂŒltig, aber ein Ende von Moores Gesetz ist in Sicht. Mit dem Ende von Moore’s Gesetz mĂŒssen neue Aspekte untersucht werden, um weiterhin die Leistung von integrierten Schaltungen zu steigern. Zwei mögliche AnsĂ€tze fĂŒr "More than Moore” sind 3D-Integrationsverfahren und heterogene Systeme. Gleichzeitig entwickelt sich ein Trend hin zu Multi-Core Prozessoren, basierend auf Networks on chips (NoCs). Neben dem Ende des Mooreschen Gesetzes ergeben sich bei immer kleiner werdenden TechnologiegrĂ¶ĂŸen, vor allem jenseits der 60 nm, neue Herausforderungen. Eine Schwierigkeit ist die WĂ€rmeableitung in großskalierten integrierten Schaltkreisen und die daraus resultierende Überhitzung des Chips. Um diesem Problem in modernen Multi-Core Architekturen zu begegnen, muss auch die Verlustleistung der Netzwerkressourcen stark reduziert werden. Diese Arbeit umfasst eine durch Hardware gesteuerte Kombination aus Frequenzskalierung und Power Gating fĂŒr 3D On-Chip Netzwerke, einschließlich eines FPGA Prototypen. DafĂŒr wurde ein Takt-synchrones 2D Netzwerk auf ein dreidimensionales asynchrones Netzwerk mit mehreren Frequenzbereichen erweitert. ZusĂ€tzlich wurde ein skalierbares Online-Power-Management System mit geringem Ressourcenaufwand entwickelt. Die Verifikation neuer Hardwarekomponenten ist einer der zeitaufwendigsten Schritte im Entwicklungsprozess hochintegrierter digitaler Schaltkreise. Um diese Aufgabe zu beschleunigen und um eine parallele Softwareentwicklung zu ermöglichen, wurde im Rahmen dieser Arbeit ein automatisiertes und benutzerfreundliches Tool fĂŒr den Entwurf neuer Hardware Projekte entwickelt. Eine grafische BenutzeroberflĂ€che zum Erstellen des gesamten Designablaufs, vom Erstellen der Architektur, Parameter Deklaration, Simulation, Synthese und Test ist Teil dieses Werkzeugs. Zudem stellt die GrĂ¶ĂŸe der Architektur fĂŒr die Erstellung eines Prototypen eine besondere Herausforderung dar. FrĂŒhere Arbeiten haben es versĂ€umt, eine schnelles und unkompliziertes Prototyping, insbesondere von Architekturen mit mehr als 50 Prozessorkernen, zu realisieren. Diese Arbeit umfasst eine Design Space Exploration und FPGA-basierte Prototypen von verschiedenen 3D-NoC Implementierungen mit mehr als 80 Prozessoren

    Interconnect architectures for dynamically partially reconfigurable systems

    Get PDF
    Dynamically partially reconfigurable FPGAs (Field-Programmable Gate Arrays) allow hardware modules to be placed and removed at runtime while other parts of the system keep working. With their potential benefits, they have been the topic of a great deal of research over the last decade. To exploit the partial reconfiguration capability of FPGAs, there is a need for efficient, dynamically adaptive communication infrastructure that automatically adapts as modules are added to and removed from the system. Many bus and network-on-chip (NoC) architectures have been proposed to exploit this capability on FPGA technology. However, few realizations have been reported in the public literature to demonstrate or compare their performance in real world applications. While partial reconfiguration can offer many benefits, it is still rarely exploited in practical applications. Few full realizations of partially reconfigurable systems in current FPGA technologies have been published. More application experiments are required to understand the benefits and limitations of implementing partially reconfigurable systems and to guide their further development. The motivation of this thesis is to fill this research gap by providing empirical evidence of the cost and benefits of different interconnect architectures. The results will provide a baseline for future research and will be directly useful for circuit designers who must make a well-reasoned choice between the alternatives. This thesis contains the results of experiments to compare different NoC and bus interconnect architectures for FPGA-based designs in general and dynamically partially reconfigurable systems. These two interconnect schemes are implemented and evaluated in terms of performance, area and power consumption using FFT (Fast Fourier Transform) andANN(Artificial Neural Network) systems as benchmarks. Conclusions drawn from these results include recommendations concerning the interconnect approach for different kinds of applications. It is found that a NoC provides much better performance than a single channel bus and similar performance to a multi-channel bus in both parallel and parallel-pipelined FFT systems. This suggests that a NoC is a better choice for systems with multiple simultaneous communications like the FFT. Bus-based interconnect achieves better performance and consume less area and power than NoCbased scheme for the fully-connected feed-forward NN system. This suggests buses are a better choice for systems that do not require many simultaneous communications or systems with broadcast communications like a fully-connected feed-forward NN. Results from the experiments with dynamic partial reconfiguration demonstrate that buses have the advantages of better resource utilization and smaller reconfiguration time and memory than NoCs. However, NoCs are more flexible and expansible. They have the advantage of placing almost all of the communication infrastructure in the dynamic reconfiguration region. This means that different applications running on the FPGA can use different interconnection strategies without the overhead of fixed bus resources in the static region. Another objective of the research is to examine the partial reconfiguration process and reconfiguration overhead with current FPGA technologies. Partial reconfiguration allows users to efficiently change the number of running PEs to choose an optimal powerperformance operating point at the minimum cost of reconfiguration. However, this brings drawbacks including resource utilization inefficiency, power consumption overhead and decrease in system operating frequency. The experimental results report a 50% of resource utilization inefficiency with a power consumption overhead of less than 5% and a decrease in frequency of up to 32% compared to a static implementation. The results also show that most of the drawbacks of partial reconfiguration implementation come from the restrictions and limitations of partial reconfiguration design flow. If these limitations can be addressed, partial reconfiguration should still be considered with its potential benefits.Thesis (Ph.D.) -- University of Adelaide, School of Electrical and Electronic Engineering, 201

    NoC Emulation based on Partial Reconfiguration

    Full text link
    This paper studies the possibly of using partial reconfiguration for achieving acceleration of the emulation process of Systems on Chip based on Networks on Chip (NoC). The main advantage of using partial reconfiguration is that re-synthesis of the systems is not required and thus the emulation process can be accelerated. The paper focuses on the description of a method for building partial runtime reconfigurable systems and its application for building a NoC emulation framework. The paper also includes a brief description of all the building elements of the emulation framework and a use case that demonstrates the advantages of the use of partial reconfiguration for emulation

    Enabling Dynamic Communication for Runtime Circuit Relocation

    Get PDF
    • 

    corecore