1,298 research outputs found

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    From MARTE to Reconfigurable NoCs: A model driven design methodology

    Get PDF
    Due to the continuous exponential rise in SoC's design complexity, there is a critical need to find new seamless methodologies and tools to handle the SoC co-design aspects. We address this issue and propose a novel SoC co-design methodology based on Model Driven Engineering and the MARTE (Modeling and Analysis of Real-Time and Embedded Systems) standard proposed by Object Management Group, to raise the design abstraction levels. Extensions of this standard have enabled us to move from high level specifications to execution platforms such as reconfigurable FPGAs. In this paper, we present a high level modeling approach that targets modern Network on Chips systems. The overall objective: to perform system modeling at a high abstraction level expressed in Unified Modeling Language (UML); and afterwards, transform these high level models into detailed enriched lower level models in order to automatically generate the necessary code for final FPGA synthesis

    FPGA dynamic and partial reconfiguration : a survey of architectures, methods, and applications

    Get PDF
    Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows, and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption

    Mapping applications onto FPGA-centric clusters

    Full text link
    High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies. In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance. In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance. The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping. This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors

    Compiler and Architecture Design for Coarse-Grained Programmable Accelerators

    Get PDF
    abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels. At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism. Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201

    Optimising and evaluating designs for reconfigurable hardware

    No full text
    Growing demand for computational performance, and the rising cost for chip design and manufacturing make reconfigurable hardware increasingly attractive for digital system implementation. Reconfigurable hardware, such as field-programmable gate arrays (FPGAs), can deliver performance through parallelism while also providing flexibility to enable application builders to reconfigure them. However, reconfigurable systems, particularly those involving run-time reconfiguration, are often developed in an ad-hoc manner. Such an approach usually results in low designer productivity and can lead to inefficient designs. This thesis covers three main achievements that address this situation. The first achievement is a model that captures design parameters of reconfigurable hardware and performance parameters of a given application domain. This model supports optimisations for several design metrics such as performance, area, and power consumption. The second achievement is a technique that enhances the relocatability of bitstreams for reconfigurable devices, taking into account heterogeneous resources. This method increases the flexibility of modules represented by these bitstreams while reducing configuration storage size and design compilation time. The third achievement is a technique to characterise the power consumption of FPGAs in different activity modes. This technique includes the evaluation of standby power and dedicated low-power modes, which are crucial in meeting the requirements for battery-based mobile devices
    • …
    corecore